http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Hclam&feedformat=atomstatwiki - User contributions [US]2024-03-28T22:15:57ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=7369stat841f102010-10-25T21:43:08Z<p>Hclam: /* Perceptron */</p>
<hr />
<div>==[[Proposal Fall 2010]] ==<br />
==[[statf10841Scribe|Editor sign up]] ==<br />
{{Cleanup|date=October 8 2010|reason=Provide a summary for each topic here.}}<br />
== Summary ==<br />
=== Classification ===<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>.<br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>.<br />
=== Principle Component Analysis ===<br />
Principal component analysis (PCA) is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf]. Depending on where this methodology is applied, other common names of PCA include the [http://en.wikipedia.org/wiki/Karhunen%E2%80%93Lo%C3%A8ve_theorem Karhunen–Loève transform (KLT)] , the [http://en.wikipedia.org/wiki/Harold_Hotelling Hotelling transform], and the proper orthogonal decomposition (POD). PCA is the simplist [http://en.wikipedia.org/wiki/Eigenvector eigenvector]-based [http://en.wikipedia.org/wiki/Multivariate_analysis multivariate analysis]. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or '''principal components''') of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could.<br />
<br />
==[[f10_Stat841_digest |Digest ]] ==<br />
<br />
== ''' Reference Textbook''' ==<br />
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
== ''' Classification - September 21, 2010''' ==<br />
<br />
=== Classification ===<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which ones were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle (384 BC - 322 BC) when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression], [http://en.wikipedia.org/wiki/Cluster_analysis clustering], and [http://en.wikipedia.org/wiki/Dimension_reduction dimensionality reduction] (feature extraction or manifold learning). Please be noted that some people consider classification to be a broad area that consists of both supervised and unsupervised methods of classifying data. In this view, as can be seen in [http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm this link], clustering is simply a special case of classification and it may be called '''unsupervised classification'''.<br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers, a link to a source of which can be found [http://www.e-knowledge.ca/quotes.php?topic=Knowledge here].<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers <br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that can take a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
<br />
The '''true error rate''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
<br />
In practice, the empirical error rate is obtained to estimate the true error rate, whose value is impossible to be known because the parameter values of the underlying process cannot be known but can only be estimated using available data. The empirical error rate, in practice, estimates the true error rate quite well in that, as mentioned [http://www.liebertonline.com/doi/pdf/10.1089/106652703321825928 here], it is an unbiased estimator of the true error rate.<br />
<br />
=== Bayes Classifier ===<br />
<br />
{{Cleanup|date=October 14 2010|reason=In response to the previous tag: The naive Bayes classifier is a special (simpler) case of the Bayes classifier. It uses an extra assumption: that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. This assumption allows for an easier likelihood function <math>\,f_y(x)</math> in the equation:<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
The simper form of the likelihood function seen in the naive Bayes is:<br />
:<math><br />
\begin{align}<br />
f_y(x) = P(X=x|Y=y) = {\prod_{i=1}^{n} P(X_{i}=x_{i}|Y=y)}<br />
\end{align}<br />
</math><br />
The Bayes classifier taught in class was not the naive Bayes classifier. Perhaps a comment should be made about the naive Bayes classifier in the body of the text}}<br />
<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' Theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers [1]. Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests][2].<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into the most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class <math> y \in \mathcal{Y} </math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the posterior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h^*</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
The Bayes classifier is the optimal classifier in that it results in the least possible true probability of misclassification for any given new data input, i.e., for any generic classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief. As a result, the estimated values of the components in the trained model may deviate quite a bit from their true population values, and this can ultimately cause the calculated posterior probabilities of inputs to deviate quite a bit from their true values. Estimation of all these probability functions, as likelihood, prior probability, and evidence function is a very expensive task, computationally, which also makes some other classifiers more favorable than Bayes classifier.<br />
<br />
A detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here].<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h^*</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h^*)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
'''Theorem'''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course. ''Note: these are the known y values in the training data.'' <br />
<br />
These known data are summarized in the following tables:<br />
<br />
:[[File:裁剪.jpg]]<br />
{{Cleanup|date=September 2010|reason=this graph is not complete, the reason is that it should be in consistent with the computation below.}}<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
<br />
The Bayesian view of probability states that, for any event E, event E has a [http://en.wikipedia.org/wiki/Prior_probability prior probability] that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability of event E's occurrence, can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>. If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow. This is because one cannot possibly carry out trials for any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h^*)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the hyperplane the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and we shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that both of the two classes have [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distributions] and the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> to <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of <math>\,D(h^*)</math> is as follows:<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out alike terms and factoring).<br />
<br />
It is easy to see that, under LDA, the Bayes's classifier's decision boundary <math>\,D(h^*)</math> has the form <math>\,ax+b=0</math> and it is linear in <math>\,x</math>. This is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\left( \mu_m^\top\Sigma^{-1}<br />
\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n) \right)=0</math> . In addition, for any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between the centers (means) of <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice, where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
According to [http://www.lsv.uni-saarland.de/Vorlesung/Digital_Signal_Processing/Summer06/dsp06_chap9.pdf this link], some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that the data in each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - September 23, 2010''' ==<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>.<br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>.<br />
<br />
The derivation of the Bayes classifier's decision boundary <math>\,D(h^*)</math> under QDA is similar to that under LDA. Again, let us first consider the two-classes case where <math>\, \mathcal{Y}=\{0, 1\}</math>. This derivation is given as follows: <br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math> (by cancellation)<br />
:<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_1|)-\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1)+\log(\pi_1)=-\frac{1}{2}\log(|\Sigma_0|)-\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0)+\log(\pi_0)</math> (by taking the log of both sides)<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top\Sigma_1^{-1}x + \mu_1^\top\Sigma_1^{-1}\mu_1 - 2x^\top\Sigma_1^{-1}\mu_1 - x^\top\Sigma_0^{-1}x - \mu_0^\top\Sigma_0^{-1}\mu_0 + 2x^\top\Sigma_0^{-1}\mu_0 \right)=0</math> (by expanding out)<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top(\Sigma_1^{-1}-\Sigma_0^{-1})x + \mu_1^\top\Sigma_1^{-1}\mu_1 - \mu_0^\top\Sigma_0^{-1}\mu_0 - 2x^\top(\Sigma_1^{-1}\mu_1-\Sigma_0^{-1}\mu_0) \right)=0</math> <br />
<br />
It is easy to see that, under QDA, the decision boundary <math>\,D(h^*)</math> has the form <math>\,ax^2+bx+c=0</math> and it is quadratic in <math>\,x</math>. This is where the word ''quadratic'' in quadratic discriminant analysis comes from.<br />
<br />
As is the case with LDA, QDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\log(\frac{|\Sigma_m|}{|\Sigma_n|})-\frac{1}{2}\left( x^\top(\Sigma_m^{-1}-\Sigma_n^{-1})x + \mu_m^\top\Sigma_m^{-1}\mu_m - \mu_n^\top\Sigma_n^{-1}\mu_n - 2x^\top(\Sigma_m^{-1}\mu_m-\Sigma_n^{-1}\mu_n) \right)=0</math>.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h^*(x) = \arg\max_{k} \delta_k(x)</math> <br />
where, <br />
* In the case of LDA, which assumes that a common covariance matrix is shared by all classes, <math> \,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is linear in <math>\,x</math>.<br />
<br />
* In the case of QDA, which assumes that each class has its own covariance matrix, <math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is quadratic in <math>\,x</math>.<br />
<br />
<br />
'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
[http://www.stat.cmu.edu/~larry/=stat707/notes10.pdf See Theorem 46.6 Page 133]<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the Maximum Likelihood estimates from the sample for <math>\,\pi,\mu_k,\Sigma_k</math> in place of their true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance, denoted <math>\Sigma</math>, is defined as the weighted average of the covariance for each class. <br />
<br />
In the case where we need a common covariance matrix, we get the estimate using the following equation:<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{n-k} </math><br />
<br />
Where: <math>\,n_r</math> is the number of data points in class r, <math>\,\Sigma_r</math> is the covariance of class r and <math>\,n</math> is the total number of data points,<br />
<math>\,k</math> is the number of classes.<br />
<br />
See the details about the [http://en.wikipedia.org/wiki/Estimation_of_covariance_matrices estimation of covarience matrices].<br />
<br />
===Computation For QDA And LDA===<br />
<br />
First, let us consider QDA, and examine each of the following two cases.<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
<math>\, \Sigma_k = I </math> for every class <math>\,k</math> implies that our data is spherical. This means that the data of each class <math>\,k</math> is distributed symmetrically around the center <math>\,\mu_k</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{-1}{2}log(|I|)</math>, is zero since <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximize <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = U_kS_kV_k^\top = U_kS_kU_k^\top </math> (In general when <math>\,X=U_kS_kV_k^\top</math>, <math>\,U_k</math> is the eigenvectors of <math>\,X_kX_k^T</math> and <math>\,V_k</math> is the eigenvectors of <math>\,X_k^\top X_k</math>. <br />
So if <math>\, X_k</math> is symmetric, we will have <math>\, U_k=V_k</math>. Here <math>\, \Sigma_k </math> is symmetric, because it is the covariance matrix of <math> X_k </math>) and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (U_kS_kU_k^\top)^{-1} = (U_k^\top)^{-1}S_k^{-1}U_k^{-1} = U_kS_k^{-1}U_k^\top </math> (since <math>\,U_k</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top U_kS_k^{-1}U_k^T(x-\mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-1}(U_k^\top x-U_k^\top \mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top x-U_k^\top\mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the squared Euclidean distance between <math> \, S_k^{-\frac{1}{2}}U_k^\top x </math> and <math>\, S_k^{-\frac{1}{2}}U_k^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>.<br />
<br />
A similar transformation of all the centers can be done from <math>\,\mu_k</math> to <math>\,\mu_k^*</math> where <math> \, \mu_k^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top \mu_k </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math> and <math>\,\mu_k^*</math>, treating them as in Case 1 above.<br />
<br />
{{Cleanup|date=October 18 2010|reason=The sentence above may cause some misleading. In general case, <math>\,\Sigma_k </math> may not be the same . So you can't treat them completely the same as in Case 1 above. You need to compute <math>\, log{|\Sigma_k |} </math> differently. Here is a detailed discussion below:}}<br />
{{Cleanup|date=October 18 2010|reason=The sentence above is right since by transforming<math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>, the new variable variance is <math>I</math>}}<br />
<br />
<br />
Note that when we have multiple classes, we also need to compute <math>\, log{|\Sigma_k|}</math> respectively. Then we compute <math> \,\delta_k </math> for QDA .<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, in another word, have same covariance <math>\,\Sigma_k</math>,else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
{{Cleanup|date=October 18 2010|reason=The statement above may not be true, because in assignment 1, we did do the QDA computation using this approach although the corresponding three covarience matrices are different, the reason why the answer is Yes is as below }}<br />
<br />
The answer is Yes. Consider that you have two classes with different shapes. Given a data point, justify which class this point belongs to. You just do the transformations corresponding to the 2 classes respectively, then you get <math>\,\delta_1 ,\delta_2 </math> ,then you determine which class the data point belongs to by comparing <math> \,\delta_1 </math> and <math> \,\delta_2 </math> .<br />
<br />
In summary, to apply QDA on a data set <math>\,X</math>, in the general case where <math>\, \Sigma_k \ne I </math> for each class <math>\,k</math>, one can proceed as follows:<br />
<br />
:: Step 1: For each class <math>\,k</math>, apply singular value decomposition on <math>\,X_k</math> to obtain <math>\,S_k</math> and <math>\,U_k</math>.<br />
<br />
:: Step 2: For each class <math>\,k</math>, transform each <math>\,x</math> belonging to that class to <math>\,x^* = S_k^{-\frac{1}{2}}U_k^\top x</math>, and transform its center <math>\,\mu_k</math> to <math>\,\mu_k^* = S_k^{-\frac{1}{2}}U_k^\top \mu_k</math>.<br />
<br />
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x^*</math> and the transformed center <math>\,\mu^*</math> of each class, and assign <math>\,x</math> to the class such that the squared Euclidean distance between <math>\,x^*</math> and <math>\,\mu^*</math> is the least over all of the classes.<br />
<br />
<br />
Now, let us consider LDA. <br />
Here, one can derive a classification scheme that is quite similar to that shown above. The main difference is the assumption of a common variance across the classes, so we perform the Singular Value Decomposition once, as opposed to k times.<br />
<br />
To apply LDA on a data set <math>\,X</math>, one can proceed as follows:<br />
<br />
:: Step 1: Apply singular value decomposition on <math>\,X</math> to obtain <math>\,S</math> and <math>\,U</math>.<br />
<br />
:: Step 2: For each <math>\,x \in X</math>, transform <math>\,x</math> to <math>\,x^* = S^{-\frac{1}{2}}U^\top x</math>, and transform each center <math>\,\mu</math> to <math>\,\mu^* = S^{-\frac{1}{2}}U^\top \mu</math>.<br />
<br />
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x^*</math> and the transformed center <math>\,\mu^*</math> of each class, and assign <math>\,x</math> to the class such that the squared Euclidean distance between <math>\,x^*</math> and <math>\,\mu^*</math> is the least over all of the classes.<br />
<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In actual data scenarios, it is generally true that QDA will provide a better classifier for the data then LDA because QDA does not assume that the covariance matrix for each class is identical, as LDA assumes. However, QDA still assumes that the class conditional distribution is Gaussian, which is not always the case in real-life scenarios. The link provided at the beginning of this paragraph describes a kernel-based QDA method which does not have the Gaussian distribution assumption.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA - September 28, 2010==<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension. Pay attention, We don't do QDA with LDA. If we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below applies LDA to the same data set and reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that the algorithm created to separate the data into the two classes.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the classes of the points 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x*y+%g*y^2', k, l(1), l(2), q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that QDA is only correct in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve produced by QDA that do not lie on the correct side of the line produced by LDA.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Principal Component Analysis - September 30, 2010==<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br />
<br /><br />
Principal component analysis (PCA) is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf]. Depending on where this methodology is applied, other common names of PCA include the [http://en.wikipedia.org/wiki/Karhunen%E2%80%93Lo%C3%A8ve_theorem Karhunen–Loève transform (KLT)] , the [http://en.wikipedia.org/wiki/Harold_Hotelling Hotelling transform], and the proper orthogonal decomposition (POD). PCA is the simplist [http://en.wikipedia.org/wiki/Eigenvector eigenvector]-based [http://en.wikipedia.org/wiki/Multivariate_analysis multivariate analysis]. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or '''principal components''') of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could. <br />
<br />
<br />
Furthermore, if one considers the lower dimensional representation produced by PCA as a least squares fit of our original data, then it can also be easily shown that this representation is the one that minimizes the reconstruction error of our data. It should be noted however, that one usually does not have control over which dimensions PCA deems to be the most informative for a given set of data, and thus one usually does not know which dimensions PCA selects to be the most informative dimensions in order to create the lower-dimensional representation. <br />
<br />
<br />
Suppose <math>\,X</math> is our data matrix containing <math>\,d</math>-dimensional data. The idea behind PCA is to apply [http://en.wikipedia.org/wiki/Singular_value_decomposition singular value decomposition] to <math>\,X</math> to replace the rows of <math>\,X</math> by a subset of it that captures as much of the [http://en.wikipedia.org/wiki/Variance variance] in <math>\,X</math> as possible. First, through the application of singular value decomposition to <math>\,X</math>, PCA obtains all of our data's directions of variation. These directions would also be ordered from left to right, with the leftmost directions capturing the most amount of variation in our data and the rightmost directions capturing the least amount. Then, PCA uses a subset of these directions to map our data from its original space to a lower-dimensional space. <br />
<br />
<br />
By applying singular value decomposition to <math>\,X</math>, <math>\,X</math> is decomposed as <math>\,X = U\Sigma V^T \,</math>. The <math>\,d</math> columns of <math>\,U</math> are the [http://en.wikipedia.org/wiki/Eigenvector eigenvectors] of <math>\,XX^T \,</math>.<br />
The <math>\,d</math> columns of <math>\,V</math> are the eigenvectors of <math>\,X^TX \,</math>. The <math>\,d</math> diagonal values of <math>\,\Sigma</math> are the square roots of the [http://en.wikipedia.org/wiki/Eigenvalue eigenvalues] of <math>\,XX^T \,</math> (also of <math>\,X^TX \,</math>), and they correspond to the columns of <math>\,U</math> (also of <math>\,V</math>). <br />
<br />
<br />
We are interested in <math>\,U</math>, whose <math>\,d</math> columns are the <math>\,d</math> directions of variation of our data. Ordered from left to right, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most informative direction of variation of our data. That is, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most effective column in terms of capturing the total variance exhibited by our data. A subset of the columns of <math>\,U</math> is used by PCA to reduce the dimensionality of <math>\,X</math> by projecting <math>\,X</math> onto the columns of this subset. In practice, when we apply PCA to <math>\,X</math> to reduce the dimensionality of <math>\,X</math> from <math>\,d</math> to <math>\,k</math>, where <math>k < d\,</math>, we would proceed as follows:<br />
<br />
:: Step 1: Center <math>\,X</math> so that it would have zero mean.<br />
<br />
:: Step 2: Apply singular value decomposition to <math>\,X</math> to obtain <math>\,U</math>.<br />
<br />
:: Step 3: Suppose we denote the resulting <math>\,k</math>-dimensional representation of <math>\,X</math> by <math>\,Y</math>. Then, <math>\,Y</math> is obtained as <math>\,Y = U_k^TX</math>. Here, <math>\,U_k</math> consists of the first (leftmost) <math>\,k</math> columns of <math>\,U</math> that correspond to the <math>\,k</math> largest diagonal elements of <math>\,\Sigma</math>.<br />
<br />
<br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 130 images (28 by 23 pixels) of handwritten threes. <br />
{{Cleanup|date=September 6 2010|reason=If anyone can tell me where I can find the 2-3 data set, I would create the new image. In the mean time, I found a non-copyrighted image of different looking 3s online, but as you can see, it is not as nice as one we could make.}}<br />
{{Cleanup|date=September 6 2010|reason=I think you can find it on your UW-ACE account for this course.}}<br />
<br />
[[File:Handwritten 3s.gif]]<br />
<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 130 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[Image:23plotPCA.jpg]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slant of each handwritten digit.<br />
<br />
===Derivation of the first Principle Component===<br />
{{Cleanup|date=October 2010|reason=I think English of this section must be improved}}<br />
We want to find the direction of maximum variation. Let <math>\begin{align}\textbf{w}\end{align}</math> be an arbitrary direction, <math>\begin{align}\textbf{x}\end{align}</math> a data point and <math>\begin{align}\displaystyle u\end{align}</math> the length of the projection of <math>\begin{align}\textbf{x}\end{align}</math> in direction <math>\begin{align}\textbf{w}\end{align}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\begin{align}\textbf{w}\end{align}</math> is the same as <math>\begin{align}c\textbf{w}\end{align}</math>, for any scalar <math>c</math>, so without loss of generality, we assume that: <br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}.<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}. <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} .</math><br />
<br /><br /><br />
The above is the variance of <math>\begin{align}\displaystyle u \end{align}</math> formed by the weight vector <math>\begin{align}\textbf{w} \end{align}</math>. The first principal component is the vector <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the function. Our goal is to find the weight <math>\begin{align}\textbf{w} \end{align}</math> that maximizes this variability, subject to a constraint. Since our function is convex, it has no maximum value. Therefore we need to add a constraint that restricts the length of <math>\begin{align}\textbf{w} \end{align}</math>. However, we are only interested in the direction of the variability, so the problem becomes<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
s.t. <math>\textbf{w}^T \textbf{w} = 1.</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|.<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
====Lagrange Multiplier====<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector then the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
{{Cleanup|date=October 2010|reason=It is good discussion, what will happen if we don't have distinct eigenvalues and eigenvectors? What does this situation mean? }}<br />
{{Cleanup|date=October 2010|reason=If the eigenvalues are not distinct, I suppose we could still take the leftmost eigenvector by default. Not sure if this is the correct approach, so can anyone please explain further? Thanks }}<br />
{{Cleanup|date=October 2010|reason= As U is the eigenvector of a symetric matrix, is it possible that we have 2 similar eigen vector? }}<br />
<br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = Var(\sum_{i = 1}^n x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
m_X=mean(X,2);<br />
mm=repmat(m_X,1,300);<br />
XX=X-mm;<br />
[u s v] = svd(XX);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
xHat=xHat+mm;<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. This is because almost all of the noise in the noisy image is captured by the principal components (directions of variation) that capture the least amount of variation in the image, and these principal components were discarded when we used the few principal components that capture most of the image's variation to generate the image's lower-dimensional representation. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Extraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data.<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized as follows (taken from the Lecture Slides).<br />
<br />
====Algorithm ====<br />
'''Recover basis:''' Calculate <math> XX^T =\Sigma_{i=1}^{n} x_i x_{i}^{T} </math> and let <math> U=</math> eigenvectors of <math> X X^T </math> corresponding to the top <math> d </math> eigenvalues.<br />
<br />
'''Encoding training data:''' Let <math>Y=U^TX </math> where <math>Y</math> is a <math>d \times n</math> matrix of encoding of the original data.<br />
<br />
'''Reconstructing training data:''' <math>\hat{X}= UY=UU^TX </math>.<br />
<br />
'''Encode set example:''' <math> y=U^T x </math> where <math> y </math> is a <math>d-</math>dimentional encoding of <math>x</math>.<br />
<br />
'''Reconstruct test example:''' <math>\hat{x}= Uy=UU^Tx </math>.<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{D \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, U_dY_{d \times n} </math>, where <math>\,U_d</math> contains the first (leftmost) <math>\,d</math> columns of <math>\,U</math>.<br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
== Fisher's (Linear) Discriminant Analysis (FDA) - Two Class Problem - October 5, 2010 ==<br />
<br />
===Sir Ronald A. Fisher===<br />
Fisher's Discriminant Analysis (FDA), also known as Fisher's Linear Discriminant Analysis (LDA) in some sources, is a classical feature extraction technique. It was originally described in 1936 by Sir [http://en.wikipedia.org/wiki/Ronald_A._Fisher Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here]. <br />
In this paper Fisher used for the first time the term DISCRIMINANT FUNCTION. The term DISCRIMINANT ANALYSIS was introduced later by Fisher himself in a subsequent paper which can be found [http://digital.library.adelaide.edu.au/coll/special//fisher/155.pdf here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
As in PCA, the goal of FDA is to project the data in a lower dimension. You might ask, why was FDA invented when PCA already existed? There is a simple explanation for this that can be found [http://www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf here]. PCA is an unsupervised method for classification, so it does not take into account the labels in the data. Suppose we have two clusters that have very different or even opposite labels from each other but are nevertheless positioned in a way such that they are very much parallel to each other and also very near to each other. In this case, most of the total variation of the data is in the direction of these two clusters. If we use PCA in cases like this, then both clusters would be projected onto the direction of greatest variation of the data to become sort of like a single cluster after projection. PCA would therefore mix up these two clusters that, in fact, have very different labels. What we need to do instead, in this cases like this, is to project the data onto a direction that is orthogonal to the direction of greatest variation of the data. This direction is in the least variation of the data. On the 1-dimensional space resulting from such a projection, we would then be able to effectively classify the data, because these two clusters would be perfectly or nearly perfectly separated from each other taking into account of their labels. This is exactly the idea behind FDA. <br />
<br />
The main difference between FDA and PCA is that, in FDA, in contrast to PCA, we are not interested in retaining as much of the variance of our original data as possible. Rather, in FDA, our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for a direction that is most representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
Suppose we have 2-dimensional data, then FDA would attempt to project the data of each class onto a point in such a way that the resulting two points would be as far apart from each other as possible. Intuitively, this basic idea behind FDA is the optimal way for separating each pair of classes along a certain direction. <br />
<br />
{{Cleanup|date=October 2010|reason= Just a thought: how relevant is "Dimensionality reduction techniques" to the concept of "subspace clustering"? As in subspace clustering, the goal is to find a set of features (relevant features, the concept is referred to as local feature relevance in the literature) in the high dimensional space, where potential subspaces accommodating different classes of data points can be defined. This means; the data points are dense when they are considered in a subset of dimensions (features).}}<br />
{{Cleanup|date=October 2010|reason=If I'm not mistaken, classification techniques like FDA use labeled training data whereas clustering techniques use unlabeled training data instead. Any other input regarding this would be much appreciated. Thanks}}<br />
{{Cleanup|date=October 2010|reason=An extension of clustering is subspace clustering in which different subspace are searched through to find the relavant and appropriate dimentions. High dimentional data sets are roughly equiedistant from each other, so feature selection methods are used to remove the irrelavant dimentions. These techniques do not keep the relative distance so PCA is not useful for these applications. It should be noted that subspace clustering localize their search unlike feature selection algorithms.for more information click here[http://portal.acm.org/citation.cfm?id=1007731]}}<br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes:<br />
<br><br />
For a 2-classes problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k-classes problem, we want to reduce the data to k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within-variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
{{Cleanup|date=October2010|reason=Anyone please add an example to make the comparison clearer}}<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. In the paper,"[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] " written by our instructor, they propose a closed-form solution to one algorithm that previously required expensive semidefinite optimization. They provide a new problem setup in which the algorithm performs better or as well as some standard methods, but without the computational complexity. Furthermore, they show a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). They also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
===FDA Goals===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
==== Example in R ====<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
<br />
FDA projects the data into lower dimensional space, where the distances between the projected means are maximum and the within-class variances are minimum. There are two categories of classification problems:<br />
<br />
1. Two-class problem<br />
<br />
2. Multi-class problem (addressed next lecture)<br />
<br />
=== Two-class problem ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within each class''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances (the summation of the two covariances is a valid covariance, satisfying the symmetry and positive semi-definite criteria). <br />
<br />
{{Cleanup|date=October 2010|reason=In 2. above, I wonder if the computation would be much more complex if we instead find a weighted sum of the covariances of the two classes where the weights are the sizes of the two classes?}}<br />
<br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> where <math>\ z_i </math> is a scalar<br />
<br />
====1. Minimizing within-class variance==== <br />
<math>\displaystyle \min_w (\underline{w}^T\sum_1\underline{w}) </math><br />
<br />
<math>\displaystyle \min_w (\underline{w}^T\sum_2\underline{w}) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math><br />
<br> (where <math>\,\sum_1</math> and <math>\,\sum_2 </math> are the covariance matrices of the 1st and 2nd classes of data, respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within-classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math>.<br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br /> <br />
<math>\displaystyle \max_w ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2, </math><br />
<br /><br /><br />
<math>\begin{align} ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2 &= (\underline{w}^T \mu_1 - \underline{w}^T \mu_2)^T(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\<br />
&= (\mu_1^T\underline{w} - \mu_2^T\underline{w})(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\<br />
&= (\mu_1 - \mu_2)^T \underline{w} \underline{w}^T (\mu_1 - \mu_2) \\<br />
<br />
&= ((\mu_1 - \mu_2)^T \underline{w})^{T} (\underline{w}^T (\mu_1 - \mu_2))^{T} \\<br />
&= \underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T \underline{w} \end{align}</math><br /><br />
<br />
Note that in the last line above the order is rearranged clockwise because the answer is a scalar.<br />
<br />
Let <math>\displaystyle s_B = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math>, the between-class covariance, then the goal is to <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math>.<br />
<br />
===The Objective Function for FDA===<br />
We want an objective function which satisfies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math> or <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math><br />
# <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math> <br />
<br /><br /><br />
So, we construct our objective function as maximizing the ratio of the two goals brought above:<br /><br />
<br /><br />
<math>\displaystyle \max_w \frac{(\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w})} {(\underline{w}^T(\sum_1 + \sum_2)\underline{w})} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max_w \frac{(\underline{w}^Ts_B\underline{w})}{(\underline{w}^Ts_w\underline{w})}</math> <br /><br />
One may argue that we can use subtraction for this purpose, while this approach is true but it can be shown it will need another scaling factor. Thus using this ratio is more efficient.<br />
<br />
As the objective function is convex, and so it does not have a maximum. To get around this problem, we have to add the constraint that w must have unit length, and then solvethis optimization problem we form the lagrangian:<br />
<br />
<br /><br /><br />
<math>\displaystyle L(\underline{w},\lambda) = \underline{w}^Ts_B\underline{w} - \lambda (\underline{w}^Ts_w\underline{w} -1)</math><br /><br /><br />
<br />
<br /><br />
Then, we equate the partial derivative of L with respect to <math>\underline{w}</math>:<br />
<math>\displaystyle \frac{\partial L}{\partial \underline{w}}=2s_B \underline{w} - 2\lambda s_w \underline{w} = 0 </math> <br /><br />
<br />
<math>s_B \underline{w} = \lambda s_w \underline{w}</math><br /><br />
<math>s_w^{-1}s_B \underline{w}= \lambda\underline{w}</math><br /><br /><br />
This is in the form of generalized eigenvalue problem. Therefore, <math> \underline{w}</math> is the largest eigenvector of <math>s_w^{-1}s_B </math><br /><br />
<br />
This solution can be further simplified as follow:<br /><br />
<br />
<math>s_w^{-1}(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w} = \lambda\underline{w} </math><br /><br />
<br />
Since <math>(\mu_1 - \mu_2)^T\underline{w}</math> is a scalar then <math>s_w^{-1}(\mu_1 - \mu_2)</math>∝<math>\underline{w}</math> <br /><br /><br />
This gives the direction of <math>\underline{w}</math> without doing eigenvalue decomposition in the case of 2-class problem.<br />
<br />
Note: In order for <math>{s_w}</math> to have an inverse, it must have full rank. This can be achieved by ensuring that the number of data points <math>\,\ge</math> the dimensionality of <math>\underline{x_{i}}</math>.<br />
<br />
===FDA Using Matlab===<br />
Note: ''The following example was not actually mentioned in this lecture''<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods. <br />
%First of all, we generate the two data set:<br />
% First data set X1<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300);<br />
%In this case: <br />
mu_1=[1;1]; <br />
Sigma_1=[1 1.5; 1.5 3]; <br />
%where mu and sigma are the mean and covariance matrix.<br />
% Second data set X2<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300); <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
plot(X1(:,1),X1(:,2),'.b'); hold on;<br />
plot(X2(:,1),X2(:,2),'ob')<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
% Combine data sets to map both into the same subspace<br />
X=[X1;X2];<br />
X=X';<br />
% We used built-in PCA function in Matlab<br />
[coefs, scores]=princomp(X);<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=sw\[4; 2] % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is very little overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
===Some of FDA applications===<br />
There are many applications for FDA in many domains some of them are stated below:<br />
<br />
* SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS<br />
FDA can be used to enhance listening comprehension when the user goes from a sound<br />
environment to another different one. For more information review this paper by Alexandre et al.[http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/papers/1569101740.pdf here]<br />
<br />
* Application to Face Recognition<br />
FDA can be used in face recognition at different situation. Using FDA Kong et al. proposes an Application to Face<br />
Recognition with Small Number of Training Samples [http://person.hst.aau.dk/pimuller/2D_FDA_Face_CVPR05fish.pdf here].<br />
<br />
* Palmprint Recognition<br />
FDA is used in biometrics, to implement an automated palmprint recognition system. See An Automated Palmprint Recognition System by Tee et al. [http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6V09-4FJ5XPN-1-1&_cdi=5641&_user=1067412&_pii=S0262885605000089&_origin=search&_coverDate=05%2F01%2F2005&_sk=999769994&view=c&wchp=dGLbVzz-zSkWb&md5=a064b67c9bdaaba7e06d800b6c9b209b&ie=/sdarticle.pdf here].<br />
<br />
{{Cleanup|date=October 2010|reason=I think briefing about the other applications would be easier than browsing through all of these applications}}<br />
<br />
{{Cleanup|date=October 2010|reason= This link is no longer valid.}}<br />
<br />
other applications could found in references 4,5,6,7,8 and more in [http://www.sciencedirect.com/science?_ob=ArticleListURL&_method=list&_ArticleListID=1489148820&_sort=r&_st=13&view=c&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=f210273546a659c90ae0962fce7b8b4e&searchtype=a here]<br />
<br />
=== '''References'''===<br />
1. Kong, H.; Wang, L.; Teoh, E.K.; Wang, J.-G.; Venkateswarlu, R.; , "A framework of 2D Fisher discriminant analysis: application to face recognition with small number of training samples," Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , vol.2, no., pp. 1083- 1088 vol. 2, 20-25 June 2005<br />
doi: 10.1109/CVPR.2005.30<br />
[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1467563&isnumber=31473 1]<br />
<br />
2. Enrique Alexandre, Roberto Gil-Pita, Lucas Cuadra, Lorena A´lvarez, Manuel Rosa-Zurera, "SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS USING A TWO-LAYER CLASSIFICATION SYSTEM WITH MSE LINEAR DISCRIMINANTS", 16th European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland, August 25-29, 2008, copyright by EURASIP, [http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/welcome.html 2]<br />
<br />
3. Connie, Tee; Jin, Andrew Teoh Beng; Ong, Michael Goh Kah; Ling, David Ngo Chek; "An automated palmprint recognition system", Journal of Image and Vision Computing, 2005. [http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V09-4FJ5XPN-1&_user=1067412&_coverDate=05/01/2005&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_searchStrId=1489147048&_rerunOrigin=google&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=a781a68c29fbf127473ae9baa5885fe7&searchtype=a 3]<br />
<br />
4. met, Francesca; Boqué, Ricard; Ferré, Joan; "Application of non-negative matrix factorization combined with Fisher's linear discriminant analysis for classification of olive oil excitation-emission fluorescence spectra", Journal of Chemometrics and Intelligent Laboratory Systems, 2006.<br />
[http://www.sciencedirect.com/science/article/B6TFP-4HR769Y-1/2/b5244d459265abb3a1bf5238132c737e 4]<br />
<br />
5. Chiang, Leo H.;Kotanchek, Mark E.;Kordon, Arthur K.; "Fault diagnosis based on Fisher discriminant analysis and support vector machines"<br />
Journal of Computers & Chemical Engineering, 2004<br />
[http://www.sciencedirect.com/science/article/B6TFT-4B4XPRS-1/2/bca7462236924d29ea23ec633a6eb236 5]<br />
<br />
6. Yang, Jian ;Frangi, Alejandro F.; Yang, Jing-yu; "A new kernel Fisher discriminant algorithm with application to face recognition", 2004<br />
[http://www.sciencedirect.com/science/article/B6V10-4997WS1-1/2/78f2d27c7d531a3f5faba2f6f4d12b45 6]<br />
<br />
7. Cawley, Gavin C.; Talbot, Nicola L. C.; "Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers", Journal of Pattern Recognition , 2003 [http://www.sciencedirect.com/science/article/B6V14-492718R-1/2/bd6e5d0495023a1db92ab7169cc96dde 7]<br />
<br />
8. Kodipaka, S.; Vemuri, B.C.; Rangarajan, A.; Leonard, C.M.; Schmallfuss, I.; Eisenschenk, S.; "Kernel Fisher discriminant for shape-based classification in epilepsy" Hournal Medical Image Analysis, 2007. [http://www.sciencedirect.com/science/article/B6W6Y-4MH8BS0-1/2/055fb314828d785a5c3ca3a6bf3c24e9 8]<br />
<br />
==Fisher's (Linear) Discriminant Analysis (FDA) - Multi-Class Problem - October 7, 2010==<br />
<br />
====Obtaining Covariance Matrices====<br />
<br />
<br />
The within-class covariance matrix <math>\mathbf{S}_{W}</math> is easy to obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between-class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to compute directly. To bypass this problem we use the following method. We know that the total covariance <math>\,\mathbf{S}_{T}</math> of a given set of data is constant and known, and we can also decompose this variance into two parts: the within-class variance <math>\mathbf{S}_{W}</math> and the between-class variance <math>\mathbf{S}_{B}</math> in a way that is similar to [http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA]. We thus have:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
where the total variance is given by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = <br />
\frac{1}{n}<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
We can now get <math>\mathbf{S}_{B}</math> from the relationship: <br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
<br />
Actually, there is another way to obtain <math>\mathbf{S}_{B}</math>. Suppose the data contains <math>\, k </math> classes, and each class <math>\, j </math> contains <math>\, n_{j} </math> data points. We denote the overall mean vector by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\frac{1}{n} \sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within-class covariance <math>\mathbf{S}_{W}</math><br />
and the between-class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term in the final line of the derivation above as the between-class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^* =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})^{T})<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))((\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W})<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the following as our measure:<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to the largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
{{Cleanup|date=What if we encounter complex eigenvalues? Then concept of being large does not dense. What is the solution in that case? }}<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
<br />
Recall that the Frobenius norm of <math>X</math> is <br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2_{2} = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have the following classic criterion function that Fisher used<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to the largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
{{Cleanup|date=October 2010|reason=Adding more general comments about the advantages and flaws of FDA would be effective here.}}<br />
<br />
{{Cleanup|date=October 2010|reason=Would you please show how could we reconstruct our original data from the data that its dimentionality is reduced by FDA.}}<br />
{{Cleanup|date=October 2010|reason= When you reduce the dimensionality of data in most general form you lose some features of the data and you cannot reconstruct the data from redacted space unless the data have special features that help you in reconstruction like sparsity. In FDA it seems that we cannot reconstruct data in general form using reducted version of data }}<br />
<br />
===Generalization of Fisher's Linear Discriminant Analysis ===<br />
<br />
Fisher's Linear Discriminant Analysis (Fisher, 1936) is very popular among users of discriminant analysis. Some of the reasons for this are its simplicity<br />
and lack of necessity for strict assumptions. However, it has optimality properties only if the underlying distributions of the groups are multivariate normal. It is also easy to verify that the discriminant rule obtained can be very harmed by only a small number of outlying observations. Outliers are very hard to detect in multivariate data sets and even when they are detected simply discarding them is not the most efficient way of handling the situation. Therefore, there is a need for robust procedures that can accommodate the outliers and are not strongly affected by them. Then, a generalization of Fisher's linear discriminant algorithm [[http://www.math.ist.utl.pt/~apires/PDFs/APJB_RP96.pdf]] is developed to lead easily to a very robust procedure.<br />
<br />
Also notice that LDA can be seen as a dimensionality reduction technique. In general k-class problems, we have k means which lie on a linear subspace with dimension k-1. Given a data point, we are looking for the closest class mean to this point. In LDA, we project the data point to the linear subspace and calculate distances within that subspace. If the dimensionality of the data, d, is much larger than the number of classes, k, then we have a considerable drop in dimensionality from d dimensions to k - 1 dimensions.<br />
<br />
==Linear and Logistic Regression - October 12, 2010==<br />
<br />
===Linear Regression===<br />
Linear regression is an approach for modeling the scalar value <math>\, y</math> from a set of dependent variables <math>\,X</math>. In linear regression the goal is to find an appropriate set of dependent variables to <math>\, y</math> and try to estimate its value from the related set. While in classification the goal is to classify data to different groups in which the inner similarity among the group members are more than variables which belong to different groups.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
According to Bayes Classification we estimate the posterior as,<br/><br />
<math>P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{l}f_{l}(x)\pi_{l}}</math><br/><br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The simple linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
y_i = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
and we can denote it as<br />
:<math><br />
\begin{align}<br />
\mathbf{y} = \beta^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
where <math>\,\beta^{T} = (<br />
\beta_1,..., \beta_{d},\beta_0)</math> is a <math>1 \times (d+1)</math> vector and <math>\mathbf{X}=<br />
\begin{pmatrix}<br />
\mathbf{x}_{1}, \dots,\mathbf{x}_{n}\\<br />
1, \dots, 1<br />
\end{pmatrix}<br />
</math> is a <math>(d+1) \times n</math> matrix, here <math>\mathbf{x}_{i} </math> is a <math>d \times 1</math> vector<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{n}</math> and <math>\,y_{1}, ..., y_{n}</math> our goal is to find <math>\,\beta^{T}</math> such that the linear model fits the data while minimizing sum of squared errors using the [http://en.wikipedia.org/wiki/Least_squares Least Squares method].<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
We then try to minimize the residual sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\beta^{T}\mathbf{X})^{T}(\mathbf{y}-\beta^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}(\mathbf{y}-\beta^{T}\mathbf{X})^{T}<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}(\mathbf{y}^{T}-\mathbf{X}^{T}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}\mathbf{y}^{T}<br />
\end{align}<br />
</math><br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \hat\beta^{T}\mathbf{X} = <br />
\mathbf{y}\mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X} </math> is called the [http://en.wikipedia.org/wiki/Hat_matrix hat matrix].<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{l}f_{l}(x)\pi_{l}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1. This is more direct approach to classification since it do not need to estimate <math>\ f_k(x) </math> and <math>\ \pi_k </math>.<br />
<math>\ 1 \times P(Y=1|X=x)+0 \times P(Y=0|X=x)=E(Y|X) </math>. <br />
This model does not classify Y between 0 and 1, so it is not good but at times it can lead to a decent classifier. <math>\ y_i=\frac{1}{n_1} </math> <math>\ \frac{-1}{n_2} </math><br />
<br />
===Logistic Regression===<br />
The [http://en.wikipedia.org/wiki/Logistic_regression logistic regression] model arises from the desire to model the posterior probabilities of the <math>\displaystyle K</math> classes via linear functions in <math>\displaystyle x</math>, while at the same time ensuring that they sum to one and remain in [0,1].Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood ,using <math>\displaystyle Pr(Y|X)</math>. Since <math>\displaystyle Pr(Y|X)</math> completely specifies the conditional distribution, the multinomial distribution is appropriate. This model is widely used in biostatistical applications for two classes. For instance: people survive or die, have a disease or not, have a risk factor or not.<br />
<br />
==== logistic function ====<br />
[[File:200px-Logistic-curve.svg.png | Logistic Sigmoid Function]]<br />
<br />
<br />
<br />
A [http://en.wikipedia.org/wiki/Logistic_function logistic function] or logistic curve is the most common sigmoid curve. <br />
<br />
1. <math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
2. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
3. <math>y(0) = \frac{1}{2}</math><br />
<br />
4. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
5. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
====Intuition behind Logistic Regression====<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
====The Logistic Regression Model====<br />
<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
{{Cleanup|date=October 13 2010|reason=It could be useful to have sources for these graphs. We don't know if they are copyrighted}}<br />
{{Cleanup|date=October 18 2010|reason=I Could not find any source for these graphs. However, they following the definition of the defined probability. I don't think the generated graph as it is here is copyrighted, but if you worried you can draw this figure by applying the function and post the result.}}<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
{{Cleanup|date=October 13 2010|reason=It could be useful to have sources for these graphs. We don't know if they are copyrighted}}<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
====Fitting a Logistic Regression====<br />
Logistic regression tries to fit a distribution. The common practice in statistics is to fit density function, posterior density of each class(Pr(Y|X), to data using [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence and identical distribution)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
{{Cleanup|date=October 13 2010|reason=I think, in the following, y_i * x_i and the single x_i on the right side should both be transposed by matrix calculus?}}<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math> <br />
<br />
There are n+1 nonlinear equations in <math> \beta </math>. The first column is vector 1, then <math>\ \sum_{i=1}^n {y_i} =\sum_{i=1}^n p(\underline{x}_i;\underline{\beta}) </math> i.e. the expected number of class ones matches the observed number.<br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
====Extension====<br />
<br />
* When we are dealing with a problem with more than two classes, we need to generalize our logistic regression to a [http://en.wikipedia.org/wiki/Multinomial_logit Multinomial Logit model].<br />
<br />
* Limitations of Logistic Regression:<br />
:1. We know that there is no assumptions made about the distributions of the features of the data (i.e. the explanatory variables). However, the features should not be highly correlated with one another because this could cause problems with estimation.<br />
:2. Large number of data points (i.e.the sample sizes) are required for logistic regression to provide sufficient numbers in both classes. The more number of features/dimensions of the data, the larger the sample size required.<br />
<br />
==Lecture summary==<br />
{{Cleanup|date=October 18 2010|reason=Can anybody provide a better lecture summary? The one below is to just get it started}}<br />
In this lecture an introduction of the linear regression was presented as well as defining the density function for two-class problem. Maximum likelihood was used to define the distribution parameters (i.e. fitting density function to the logistic class.<br />
<br />
== Logistic Regression Cont. - October 14, 2010 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Estimating Parameters <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
'''Newton-Raphson Algorithm:'''<br /><br />
<br />
If we want to find <math>\ x^* </math> such that <math>\ f(x^*)=0</math><br />
<br />
We first pick a starting point <math>x^* = x^{old}</math> and and we solve:<br />
<br \><br />
<br />
<math>\ x^{*} \leftarrow x^{old}-\frac {f(x^{old})}{\partial f(x^{old})} </math> <br /><br />
<math> \ x^{old} \leftarrow x^{*}</math> <br />
<br /><br />
This is repeated till convergence <br />
<br />
If we want to maximize or minimize <math>\ f(x) </math>, then solve for <math>\ \partial f(x)=0 </math><br />
<br />
<math>\ X^{new} \leftarrow x^{old}-\frac {\partial f(x^{old})}{\partial^2 f(x^{old})} </math><br />
<br />
<br /><br />
<br />
In vector notation the above can be written as <br /><br />
<br />
<math><br />
X^{new} \leftarrow X^{old} - H^{-1}\Delta<br />
</math><br />
<br /><br />
H is the [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix] or second derivative matrix and <math>\Delta</math> is the gradient both evaluated at <math>X^{old}</math> <br />
<br /><br />
<br />
'''note:''' If the Hessian is not invertible the [http://en.wikipedia.org/wiki/Generalized_inverse generalized inverse] or pseudo inverse can be used<br />
<br /><br />
<br /><br />
<br />
<br />
As shown above ,the [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or Hessian.<br />
<br />
<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}_i))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{(d+1)}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i,i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^{T}\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
This is an adjusted response and it is solved repeatedly when <math>\ p </math>, <math>\ W </math>, and <math>\ z </math> changes. This algorithm is called [http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares iteratively reweighted least squares] because it solves the weighted least squares problem repeatedly.<br />
<br />
Recall that linear regression by least squares finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}^{T}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow arg \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta}^T)</math><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_{i,i}</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They both attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They both have linear boundaries.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1. There exists a closed form solution for least squares.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1. No closed form solution exists<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quadratically w.r.t dimension.<br />
#LDA estimate parameters more efficiently by using more information about data and samples without class labels can be also used in LDA. <br />
<br />
{{Cleanup|date=October 2010|reason= Could somebody please validate the following points}} <br />
{{Cleanup|date=October 2010|reason= I'm not too sure about the first point either, but it seems reasonable to me. Would be great if someone can confirm this point. Thanks}} <br />
<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust. (For high dimensionality logistic regression is more accommodating)<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
#Logistic regression is more robust, because it does not assume normal distribution regarding each independent variable.<br />
<br />
Many other advantages of logistic regression are explained [http://www.statgun.com/tutorials/logistic-regression.html here].<br />
<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>\,(d+1)</math><math>\,\times</math><math>\,(k-1)</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
===Lecture Summary===<br />
<br />
Traditionally logistic regression parameters are estimated using maximum likelihood. However , other optimization techniques may be used as well.<br />
<br /><br />
Since there is no closed form solution for finding the zero of the first derivative of the log likelihood the Newton Raphson algorithm is used. Since the problem is convex Newtons is guaranteed to converge to a global optimum.<br />
<br /><br />
Logistic regression requires less parameters than LDA or QDA and is therefore more favorable for high dimensional data.<br />
<br />
===Supplements===<br />
<br />
A detailed proof that logistic regression is convex is available [http://people.csail.mit.edu/jrennie/writing/convexLR.pdf here]. See '1 Binary LR' for the case we discussed in lecture.<br />
<br />
== '''Multi-Class Logistic Regression & Perceptron - October 19, 2010''' ==<br />
<br />
=== Lecture Summary ===<br />
<br />
In this lecture, the topic of logistic regression was finalized by covering the multi-class logistic regression and a new topic on perceptron was introduced. Perceptron is a linear classifier for two-class problems. The main goal of perceptron is classify data in 2 classes by minimizing the distances between the misclassified points and the decision boundary. This will be continued in the following lectures.<br />
<br />
=== Multi-Class Logistic Regression ===<br />
Recall that in two-class logistic regression, the posterior probability of one of the classes (say class 0) is modeled by a function in the form shown in figure 1. <br />
<br />
The posterior probability of the second class (say class 1) is the complement of the first class (class 0). <br /><br /><br />
<math>\displaystyle P(Y=0 | X=x) = 1 - P(Y=1 | X=x)</math><br /><br />
<br />
This function is called sigmoid logistic function, which is the reason why this algorithm is called "logistic regression".<br />
[[File:Picture1.png|150px|thumb|right|<math>Fig.1: P(Y=1 | X=x)</math>]]<br />
<br />
<math>\displaystyle \sigma\,\!(a) = \frac {e^a}{1+e^a} = \frac {1}{1+e^{-a}}</math><br /><br /><br />
<br />
In two-class logistic regression, we compare the posterior of one class to the other one using this ratio:<br /><br />
<br />
:<math> \frac{P(Y=1|X=x)}{P(Y=0|X=x)}</math><br /><br />
<br />
If we look at the natural logarithm of this ratio, we find that it is always a linear function in <math>x</math>:<br /><br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\underline{\beta}^T\underline{x} \quad \rightarrow (*)</math> <br /><br /><br />
<br />
What if we have more than two classes?<br /><br />
<br />
Using (*), we can extend the notion of logistic regression for the cases where we have more than two classes.<br /><br />
<br />
Assume we have <math>k</math> classes. Looking at the logarithm of the ratio of posteriors of each class and the k<sup>th</sup> class, we have: <br /><br />
<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_1}^T\underline{x} </math> <br /><br />
:<math>\log\left(\frac{P(Y=2|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_2}^T\underline{x} </math> <br /><br />
::::<math> \vdots</math><br /><br />
:<math>\log\left(\frac{P(Y=k-1|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_{k-1}}^T\underline{x} </math> <br /><br />
<br />
<br />
Although in the above posterior ratios, the denominator is chosen to be the posterior of the last class (class k), the choice of denominator is arbitrary in that the posterior estimates are equivariant under this choice - [http://www.springerlink.com/content/t45k620382733r71/ Linear Methods for Classification].<br /><br /><br />
<br />
Each of these functions is linear in <math>x</math>, however, we have different <math>\underline{\,\beta_{i}}</math>'s. We have to make sure that, the densities assigned to different classes sum to one.<br /><br /><br />
<br />
In general, we can write:<br />
<br /><math>P(Y=c | X=x) = \frac{e^{\underline{\beta_c}^T \underline{x}}}{1+\sum_{l=1}^{k-1}e^{\underline{\beta_l}^T \underline{x}}},\quad c \in \{1,\dots,k-1\} </math><br /><br />
<br /><math>P(Y=k | X=x) = \frac{1}{1+\sum_{l=1}^{k-1}e^{\underline{\beta_l}^T \underline{x}}}</math><br /><br />
These posteriors clearly sum to one. <br /><br /><br />
<br />
In the case of two-class problem, it is pretty simple to find <math>\beta</math> parameter (the <math>\beta</math> in two-class linear regression problems has <math>(d+1)\times1</math> dimension), as mentioned in previous lectures. In the multi-class case the iterative Newton method can be used, but here <math>\beta</math> is of size <math>(d+1)\times(k-1)</math> and the weight matrix W is a dense and non-diagonal matrix. This results in computationally inefficient, however feasible-to-be-solved algorithm. A trick would be to re-parametrize the logistic regression problem by expanding the input vector <math>x</math> (Question.4 in assignment no.2).<br />
<br /><br /><br />
<br />
It can be noted here that logistic regression do not assume a distribution for the prior where as LDA assumes the prior to be Bernulli. <br /><br /><br />
<br />
===Nueral Network Concept===<br />
The concept of constructing an artificial neural network comes from scientists who like to simulate human neural network in their computers. They were trying to create computer programs that can learn like people. Neural network is a method in artificial intelligence which is a simplified models of neural processing in the brain, even though the relation between this model and brain biological architecture is not cleared yet.<br />
<br />
=== Perceptron ===<br />
<br />
[http://en.wikipedia.org/wiki/Perceptron Perceptron] was invented in 1957 by [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt]. It is the basic building block of feedforward neural networks<br /><br /><br />
<br />
We know that least squares obtained by regression of -1/1 response variable <math>\displaystyle Y</math> on observation <math>\displaystyle x</math> lead to the same coefficients as LDA (recall that LDA minimizes the distance between discriminant function (decision boundary) and the data points). Least squares returns the sign of the linear combination of features as the class labels (Figure 2). This concept was called perceptron in Engineering literature during the 1950's. <br /><br /><br />
<br />
[[File:Perceptron.jpg|371px|thumb|right| Fig.2 Diagram of a linear perceptron ]]<br />
<br />
There is a cost function <math>\displaystyle D</math> that perceptron tries to minimize:<br /><br />
<br />
<math>D(\underline{\beta},\beta_0)=-\sum_{i \in M}y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math><br /><br />
<br />
where <math>\displaystyle M</math> is the set of misclassified points. <br><br /><br />
<br />
By minimizing D, we are minimizing the sum of distances between the misclassified points and the decision boundary.<br /><br /><br />
<br />
'''Derivation''':'' The distances between the misclassified points and the decision boundary''.<br /><br /><br />
<br />
Consider points <math>\underline{x_1}</math>, <math>\underline{x_2}</math> and a decision boundary defined as <math>\underline{\beta}^T\underline{x}+\beta_0</math> as shown in Figure 3.<br><br /><br />
<br />
[[File:DB.jpg|248px|thumb|right| Fig.3 Distance from the decision boundary ]]<br />
<br />
Both <math>\underline{x_1}</math> and <math>\underline{x_2}</math> lie on the decision boundary, thus:<br /><br />
<math>\underline{\beta}^T\underline{x_1}+\beta_0=0 \rightarrow (1)</math><br /><br />
<math>\underline{\beta}^T\underline{x_2}+\beta_0=0 \rightarrow (2)</math><br><br /><br />
<br />
Consider (2) - (1):<br /><br />
<math>\underline{\beta}^T(\underline{x_2}-\underline{x_1})=0</math><br><br /><br />
<br />
We see that <math>\displaystyle \underline{\beta}</math> is orthogonal to <math>\underline{x_2}-\underline{x_1}</math>, which is in the same direction with the decision boundary, which means that <math>\displaystyle \underline{\beta}</math> is orthogonal to the decision boundary. <br><br /><br />
<br />
Then the distance of a point <math>\underline{x_0}</math> from the decision boundary is: <br /><br />
<br />
<math>\underline{\beta}^T(\underline{x_0}-\underline{x_2})</math><br><br /><br />
<br />
From (2): <br /><br />
<br />
<math>\underline{\beta}^T\underline{x_2}= -\beta_0</math>. <br /><br />
<math>\underline{\beta}^T(\underline{x_0}-\underline{x_2})=\underline{\beta}^T\underline{x_0}-\underline{\beta}^T\underline{x_2}=\underline{\beta}^T\underline{x_0}+\beta_0</math><br /><br />
<br />
Therefore, distance between any point <math>\underline{x_{i}}</math> to the discriminant hyperplane is defined by <math>\underline{\beta}^T\underline{x_{i}}+\beta_0</math>.<br /><br /><br />
<br />
However, this quantity is not always positive. Consider <math>y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math>. If <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive, since both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> are positive or both are negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'', then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> is positive and the other one is negative; hence, the product <math>y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math> will be negative for a misclassified point. The "-" sign in <math>D(\underline{\beta},\beta_0)</math> makes this cost function always positive (since only misclassified points are passed to D). <br /><br /><br />
<br />
==Perceptron Learning Algorithm and Feed Forward Neural Networks - October 21, 2010 ==<br />
===Lecture Summary===<br />
In this lecture, we finalize our discussion of the Perceptron by reviewing its learning algorithm, which is based on gradient descent. We then begin the next topic, Neural Networks (NN), and we focus on a NN that is useful for classification: the Feed Forward Neural Network (FFNN). The mathematical model for the FFNN is shown, and we review one of its most popular learning algorithms: Back-Propagation. <br />
<br />
To open the Neural Network discussion, we present a formulation of the universal function approximator. The mathematical model for Neural Networks is then built upon this formulation. We also discuss the trade-off between training error and testing error -- known as the generalization problem -- under the universal function approximator section.<br />
<br />
===Perceptron===<br />
The last lecture introduced the Perceptron and showed how it can suggest a solution for the 2-class classification problem. We saw that the solution requires minimization of a cost function, which is basically a summation of the distances of the misclassified data points to the separating hyperplane. This cost function is<br />
<br />
<math>D(\underline{\beta},\beta_0)=-\sum_{i \in M}y_{i}(\underline{\beta}^T \underline{x}_i+\beta_0),</math><br />
<br />
in which, <math>\,M</math> is the set of misclassified points. Thus, the objective is to find <math>\arg\min_{\underline{\beta},\beta_0} D(\underline{\beta},\beta_0)</math>.<br />
<br />
====Perceptron Learning Algorithm====<br />
To minimize <math>D(\underline{\beta},\beta_0)</math>, an algorithm that uses gradient-descent has been suggested. Gradient descent, also known as steepest descent, is a numerical optimization technique that starts from an initial value for <math>(\underline{\beta},\beta_0)</math> and recursively approaches an optimal solution. Each step of recursion updates <math>(\underline{\beta},\beta_0)</math> by subtracting from it a factor of the gradient of <math>D(\underline{\beta},\beta_0)</math>. Mathematically, this gradient is<br />
<br />
<math>\nabla D(\underline{\beta},\beta_0)<br />
= \left( \begin{array}{c}\cfrac{\partial D}{\partial \underline{\beta}} \\ \\ <br />
\cfrac{\partial D}{\partial \beta_0} \end{array} \right)<br />
= \left( \begin{array}{c} -\displaystyle\sum_{i \in M}y_{i}\underline{x}_i^T \\ <br />
-\displaystyle\sum_{i \in M}y_{i} \end{array} \right)</math><br />
<br />
However, the perceptron learning algorithm does not use the sum of the contributions from each observation to calculate the gradient for each step. Instead, each step uses the gradient contribution from only a single observation, and each successive step uses a different observation. This slight modification is called stochastic gradient descent. That is, instead of subtracting some factor of <math>\nabla D(\underline{\beta},\beta_0)</math> at each step, we subtract a factor of<br />
<br />
<math>\left( \begin{array}{c} y_{i}\underline{x}_i \\ <br />
y_{i} \end{array} \right)</math><br />
<br />
As a result, the pseudo code for the Perceptron Learning Algorithm is as follows:<br />
<br />
:1) Choose a random initial value for <math>(\underline{\beta},\beta_0)</math>.<br />
<br />
:2) <math>\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{old}}\\<br />
\beta_0^{\mathrm{old}}<br />
\end{pmatrix}<br />
\leftarrow<br />
\begin{pmatrix}<br />
\underline{\beta}^0\\<br />
\beta_0^0<br />
\end{pmatrix}</math><br />
<br />
:3) <math>\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{new}}\\<br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
\leftarrow<br />
\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{old}}\\<br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix}<br />
y_i \underline{x_i}\\<br />
y_i<br />
\end{pmatrix}</math> for some <math>\,i \in M</math>.<br />
<br />
:4) If the termination criterion has not been met, go back to step 3 and use a different observation datapoint (i.e. a different <math>\,i</math>).<br />
<br />
The learning rate <math>\,\rho</math> controls the step size of convergence toward <math>\min_{\underline{\beta},\beta_0} D(\underline{\beta},\beta_0)</math>. A larger value for <math>\,\rho</math> causes the steps to be larger. If <math>\,\rho</math> is set to be too large, however, then the minimum could be missed (over-stepped).<br />
In practice, <math>\rho</math> can be adaptive and not fixed, it means that, in the first steps <math>\rho</math> could be larger than the last steps. At the beginning, larger <math>\rho</math> helps to find the approximate answer sooner. And smaller <math>\rho</math> in last steps help to tune the final answer more accurately. <br />
<br />
<br />
As mentioned earlier, the learning algorithm uses just one of the data points at each iteration; this is the common practice when dealing with online applications. In an online application, datapoints are accessed one-at-a-time because training data is not available in batch form. The learning algorithm does not require the derivative of the cost function with respect to the previously seen points; instead, we just have to take into consideration the effect of each new point.<br />
<br />
One way that the algorithm could terminate is if there are no more mis-classified points (i.e. if set <math>\,M</math> is empty. As long as there are points in <math>\,M</math>, the algorithm continues until some other termination criterion is reached. Termination criterion for an optimization algorithm is usually convergence, but for numerical methods this is not well-defined. In theory, convergence is realized when the gradient of the cost function is zero; in numerical methods an answer close to zero within some margin of error is taken instead.<br />
<br />
Since the data is linearly-separable, the solution is theoretically guaranteed to converge in a finite number of iterations. This number of iterations depends on the <br />
<br />
* learning rate <math>\,\rho</math><br />
<br />
* initial value <math>(\underline{\beta},\beta_0)</math><br />
<br />
* difficulty of the problem. The problem is more difficult if the gap between the classes of data is very small.<br />
<br />
Note that we consider the offset term <math>\beta_0</math> separately from the <math>\underline{\beta}</math> to distinguish this formulation from those in which the direction of the hyperplane (<math>\underline{\beta}</math>) has been considered.<br />
<br />
A major concern about gradient descent is that it may get trapped in local optimal solutions.<br />
<br />
====Some notes on the Perceptron Learning Algorithm====<br />
<br />
* If there is access to the training data points in a batch form, we should better take advantage of a closed optimization technique like least-squares or maximum-likelihood estimation for linear classifiers. (These closed solutions has been around many years before invention of the Perceptron).<br />
<br />
* Just like the linear classifier, a Perceptron can discriminate between only two classes at a time, and one can generalize its performance for multi-class problems by using one of the <math>k-1</math>, <math>k</math>, or <math>k(k-1)/2</math>-hyperplane methods.<br />
<br />
* If the two classes are linearly separable, the algorithm will converge in a finite number of iterations to a hyperplane, which makes the error of training data zero. The convergence is guaranteed if the learning rate is set adequately.<br />
<br />
* If the two classes are not linearly separable, the algorithm will never converge. So, one may think of a termination criterion in these cases. (e.g. a maximum number of iterations in which convergence is expected, or the rate of changes in both a cost function and its derivative).<br />
<br />
* In the case of linearly separable classes, the final solution and number of iterations will be dependent on the initial conditions, learning rate, and the gap between the two classes. In general, a smaller gap between classes requires a greater number of iterations for the algorithm to converge.<br />
<br />
* Learning rate --or updating step-- has a direct impact on both number of iterations and the accuracy of the solution for the optimization problem. Smaller quantities for this factor make convergence slower, even though we will end up with a more accurate solution. In the opposite way, larger values for learning rate make the process faster, even though we may lose some precision. So, one may make a balance for this trade-off in order to get fast enough to an accurate enough solution. (exploration vs. exploitation)<br />
<br />
In the upcoming lectures, we introduce the Support Vector Machines (SVM), which use a method similar in iterational optimization scheme to what the Perceptron suggests, but have a different definition for the cost function.<br />
<br />
===Universal Function Approximator===<br />
The universal function approximator is a mathematical formulation for a group of estimation techniques. The usual formulation for it is<br />
<br />
<math>\hat{Y}(x)=\sum\limits_{i=1}^{n}\alpha_i\sigma(\omega_i^Tx+b_i),</math><br />
<br />
where <math>\hat{Y}(x)</math> is an estimation for a function like <math>\,Y(x)</math>. According to the universal approximation theorem we have<br />
<br />
<math>|\hat{Y}(x) - Y(x)|<\epsilon,</math><br />
<br />
which means that <math>\hat{Y}(x)</math> can get as close to <math>\,Y(x)</math>, as necessary.<br />
<br />
This formulation assumes that the output, <math>\,Y(x)</math>, is a linear combination of a set of functions like <math>\,\sigma(.)</math> where <math>\,\sigma(.)</math> is a nonlinear function of the inputs or <math>\,x_i</math>s.<br />
<br />
====Generalization Factors====<br />
Even though this formulation represents a universal function approximator, which means that it can be fitted to a set of data as closely as demanded, the closeness of fit must be carefully decided upon. In many cases, the purpose of the model is to target unseen data. However, the fit to this unseen data is impossible to determine before it arrives.<br />
<br />
To overcome this dilemma, a common practice is to divide the test data points into two sets: training data and validation data. We use the training data to estimate the fixed parameters for the model, and then use the validation data to find values for the construction-dependent parameters. How these construction-dependent parameters vary depends on the model. In the case of a polynomial, the construction-dependent parameter would be its highest degree, and for a neural network, the construction-dependent parameter could be the number of hidden layers and the number of neurons in each layer.<br />
<br />
These matters on model generalization vs. complexity matters will be discussed with more detail in the lectures to follow.<br />
<br />
===Feed-Forward Neural Network===<br />
The Neural Network (NN) is one application of the universal function approximator. It can be thought of as a system of Perceptrons linked together as units of a network. One particular NN useful for classification is the Feed-Forward Neural Network (FFNN), which consists of multiple "hidden layers" of Perceptron units. Our discussion here is based around the FFNN, which has a toplogy shown in Figure 1. The first hidden layer of units receive input from the original features. Between the hidden layers, connections from each unit are always directed to units in the next adjacent layer. The output layer, which receives input only from the last hidden layer, each unit produces a target measurement for a distinct class (i.e. <math>\,K</math> classes require <math>\,K</math> units). In Figure 1, the units in a single layer are distributed vertically, and the inputs and outputs of the network are shown as the far left and right layers respectively.<br />
<br />
[[File:FFNN.png|300px|thumb|right|Fig.1 A common architecture for the FFNN]]<br />
<br />
====Mathematical Model of the FFNN with One Hidden Layer====<br />
The FFNN with one hidden layer for a <math>\,K</math>-class problem is defined as follows. Let <math>\,d</math> be the number of input features, <math>\,p</math> be the number of units in the hidden layer, and <math>\,K</math> be the number of classes (i.e. the number of units in the output layer).<br />
<br />
Each neural unit calculates its derived feature (i.e. output) using a linear combination of its inputs. Suppose <math>\,\underline{x}</math> is the <math>\,d</math>-dimensional vector of input features. Then, each neural unit uses a <math>\,d</math>-dimensional vector of weights to combine these input features: for the <math>\,i</math>th neural unit, let <math>\underline{u}_i</math> be this vector of weights. The linear combination calculated by the <math>\,i</math>th unit is then given by<br />
<br />
<math>a_i = \underline{u}_i^T\underline{x}</math><br />
<br />
However, we want the derived feature to lie between 0 and 1, so we apply an ''activating function'' <math>\,\sigma(a)</math>. The derived feature for the <math>\,i</math>th unit is then given by<br />
<br />
<math>\,z_i = \sigma(a_i)</math> where <math>\,\sigma</math> is typically the logistic function<br />
<br />
<math>\sigma(a) = \cfrac{1}{1+e^{-a}}</math><br />
<br />
Now, we place each of the derived features <math>\,z_i</math> from the hidden layer into a <math>\,p</math>-dimensional vector:<br />
<br />
<math>\underline{z} = \left[ \begin{array}{c} z_1 \\ z_2 \\ \vdots \\ z_p \end{array}\right]</math><br />
<br />
Like in the hidden layer, each unit in the output layer calculates its derived feature using a linear combination of its inputs. Each neural unit uses a <math>\,p</math>-dimensional vector of weights to combine the input features derived from the hidden layer. Let <math>\,\underline{w}_k</math> be this vector of weights used in the <math>\,k</math>th unit. The linear combination calculated by the <math>\,k</math>th unit is then given by<br />
<br />
<math>\hat{y}_k = \underline{w}_k^T\underline{z}</math><br />
<br />
<math>\,y_k</math> is thus the target measurement for the <math>\,k</math>th class. Note that an activation function <math>\,\sigma</math> is not used here.<br />
<br />
Notice that in each of the units, two operations take place:<br />
<br />
* a linear combination of the neuron's inputs is calculated using corresponding weights<br />
<br />
* a nonlinear operation on the linear combination is performed. <br />
<br />
These two calculations are shown in Figure 2. <br />
<br />
The nonlinear function <math>\,\sigma(.)</math> is called the activation function. Activation functions, like the logarithmic function shown earlier, are usually continuous and have a limited range. Another common activation function used in neural networks is <math>\,tanh(x)</math> (Figure 3).<br />
<br />
[[File:neuron2.png|300px|thumb|right|Fig.2 A general construction for a single neuron]]<br />
[[File:actfcn.png|300px|thumb|right|Fig.3 <math>tanh</math> as activation function]]<br />
<br />
The NN can be applied as a regression method or as a classifier, and the output layer differs depending on the application. The major difference between regression and classification is in the output space of the model, which is continuous in the case of regression, and discrete in the case of classification. For a regression task, no consideration is needed beyond what has already been mentioned earlier, since the outputs of the network would already be continuous. However, to use the neural network as a classifier, a threshold stage is necessary.<br />
<br />
====Mathematical Model of the FFNN with Multiple Hidden Layers====<br />
In the FFNN model with a single hidden layer, the derived features were represented as elements of the vector <math>\underline{z}</math>, and the original features were represented as elements of the vector <math>\underline{x}</math>. In the FFNN model with more than one hidden layer, <math>\underline{z}</math> is processed by the second hidden layer in the same way that <math>\underline{x}</math> was processed by the first hidden layer. Perceptrons in the second layer each use their own combination of weights to calculate a new set of derived features. These new derived features are processed by the third hidden layer in a similar way, and the cycle repeats for each additional hidden layer. This progression of processing is depicted in Figure 4.<br />
<br />
====Back-Propagation Learning Algorithm====<br />
<br />
[[File:bpl.png|300px|thumb|right|Fig.4 Labels for weights and derived features in the FFNN.]]<br />
<br />
Every linear-combination calculation in the FFNN involves weights that need to be set, and these weights are set using training data and an algorithm called Back-Propagation. This algorithm is similar to the gradient-descent algorithm introduced in the discussion of the Perceptron. The primary difference is that the gradient used in Back-Propagation is calculated in a more complicated way.<br />
<br />
First of all, we want to minimize the error between the estimated and true target measurements for the training data. That is, if <math>\,U</math> is the set of all weights in the FFNN, then we want to determine<br />
<br />
<math>\arg\min_U \left|y - \hat{y}\right|^2</math><br />
<br />
Now, suppose the hidden layers of the FFNN are labelled as in Figure 4. Then, we want to determine the derivative of <math>\left|y - \hat{y}\right|^2</math> with respect to each weight in the hidden layers of the FFNN. For weights <math>\,u_{jl}</math> this means we will need to find<br />
<br />
<math><br />
\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}}<br />
= \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j}\cdot<br />
\cfrac{\partial a_j}{\partial u_{jl}} = \delta_{j}z_l<br />
</math><br />
<br />
However, the closed-form solution for <math>\,\delta_{j}</math> is unknown, so we develop a recursive definition (<math>\,\delta_{j}</math> in terms of <math>\,\delta_{i}</math>):<br />
<br />
<math><br />
\delta_j = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j} <br />
= \sum_{i=1}^p \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_i}\cdot<br />
\cfrac{\partial a_i}{\partial a_j} <br />
= \sum_{i=1}^p \delta_i\cdot u_{ij} \cdot \sigma'(a_j)<br />
= \sigma'(a_j)\sum_{i=1}^p \delta_i \cdot u_{ij}<br />
</math><br />
<br />
We also need to determine the derivative of <math>\left|y - \hat{y}\right|^2</math> with respect to each weight in the ''output layer'' <math>\,k</math> of the FFNN (this layer is not shown in Figure 4, but it would be the next layer to the right of the rightmost layer shown). For weights <math>\,u_{ki}</math> this means<br />
<br />
<math><br />
\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{ki}}<br />
= \cfrac{\partial \left|y - \sum_i u_{ki}z_i\right|^2}{\partial u_{ki}}<br />
= -2(y - \sum_i u_{ki}z_i)z_i<br />
= -2(y - \hat{y})z_i<br />
</math><br />
<br />
With similarity to our computation of <math>\,\delta_j</math>, we define<br />
<br />
<math>\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_k}</math><br />
<br />
However, <math>\,a_k = \hat{y}</math> because an activation function is not applied in the output layer. So, our calculation becomes<br />
<br />
<math>\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial \hat{y}}<br />
= -2(y - \hat{y})</math><br />
<br />
Now that we have <math>\,\delta_k</math> and a recursive definition for <math>\,\delta_j</math>, it is clear that our weights can be deduced by starting from the output layer and working through the hidden layers through toward the input layer.<br />
<br />
Based on the above derivation, our algorithm for determining weights in the FFNN is as follows<br />
<br />
:1) Choose a random initial weights.<br />
<br />
:2) Apply a new datapoint <math>\underline{x}</math> to the FFNN as the input layer, and calculate the values for all units.<br />
<br />
:3) Compute <math>\,\delta_k = -2(y_k - \hat{y}_k)</math>.<br />
<br />
:4) Back-propagate layer-by-layer by computing <math>\delta_j = \sigma'(a_j)\sum_{i=1}^p \delta_i \cdot u_{ij}</math> for all units.<br />
<br />
:5) Compute <math>\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} = \delta_{j}z_l</math> for all weights <math>\,u_{jl}</math>.<br />
<br />
:6) Update <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}}<br />
- \rho \cdot \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} </math> for all weights <math>\,u_{jl}</math> where <math>\,\rho</math> is the learning rate.<br />
<br />
:7) If the termination criterion has not been met, go back to step 2 and apply another datapoint (ie. begin a new "epoch").<br />
<br />
====Alternative Description of the Back-Propagation Algorithm====<br />
Label the inputs and outputs of the <math>\,i</math>th hidden layer <math>\underline{x}_i</math> and <math>\underline{y}_i</math> respectively, and let <math>\,\sigma(.)</math> be the activation function for all neurons. We now have<br />
<br />
<math>\begin{align}<br />
\begin{cases}<br />
\underline{y}_1=\sigma(W_1.\underline{x}_1),\\<br />
\underline{y}_2=\sigma(W_2.\underline{x}_2),\\<br />
\underline{y}_3=\sigma(W_3.\underline{x}_3),<br />
\end{cases}<br />
\end{align}</math><br />
<br />
Where <math>\,W_i</math> is a matrix of the connection's weights, between two layers of <math>\,i</math> and <math>\,i+1</math>, and has <math>\,n_i</math> columns and <math>\,n_i+1</math> rows, where <math>\,n_i</math> is the number of neurons of the <math>\,i^{th}</math> layer.<br />
<br />
Considering this matrix equations, one can imagine a closed form for the derivative of the error in respect to the weights of the network. For a neural network with two hidden layers, the equations are as follows.<br />
<br />
<math>\begin{align}<br />
\frac{\partial E}{\partial W_3}=&diag(e).\sigma'(W_3.\underline{x}_3).(\underline{x}_3)^T,\\<br />
\frac{\partial E}{\partial W_2}=&\sigma'(W_2.\underline{x}_2).(\underline{x}_2)^T.diag\{\sum rows\{diag(e).diag(\sigma'(W_3.\underline{x}_3)).W_3\}\},\\<br />
\frac{\partial E}{\partial W_1}=&\sigma'(W_1.\underline{x}_1).(\underline{x}_1)^T.diag\{\sum rows\{diag(e).diag(\sigma'(W_3.\underline{x}_3)).W_3.diag(\sigma'(W_2.\underline{x}_2)).W_2\}\},<br />
\end{align}</math><br />
<br />
where <math>\,\sigma'(.)</math> is the derivative of the activation function <math>\,\sigma(.)</math>.<br />
<br />
Using this closed form derivative, it is possible to code the procedure for any number of layers and neurons. Here is a Matlab code for backpropagation algorithm. (<math>\,tanh</math> is utilized as the activation function.)<br />
<br />
<br />
while i < ep<br />
i = i + 1;<br />
data = shuffle(data,2);<br />
for j = 1:Q<br />
io = zeros(max(n)+1,length(n));<br />
gp = io;<br />
io(1:n(1)+1,1) = [1;data(1:f,j)];<br />
for k = 1:l<br />
io(2:n(k+1)+1,k+1) = w(2:n(k+1)+1,1:n(k)+1,k)*io(1:n(k)+1,k);<br />
gp(1:n(k+1)+1,k) = [0;1./(cosh(io(2:n(k+1)+1,k+1))).^2];<br />
io(1:n(k+1)+1,k+1) = [1;tanh(io(2:n(k+1)+1,k+1))];<br />
wg(1:n(k+1)+1,1:n(k)+1,k) = diag(gp(1:n(k+1)+1,k))*w(1:n(k+1)+1,1:n(k)+1,k);<br />
end<br />
e = [0;io(2:n(l+1)+1,l+1) - data(f+1:dd,j)];<br />
wg(1:n(l+1)+1,1:n(l)+1,l) = diag(e)*wg(1:n(l+1)+1,1:n(l)+1,l);<br />
gp(1:n(l+1)+1,l) = diag(e)*gp(1:n(l+1)+1,l);<br />
d = eye(n(l+1)+1);<br />
E(i) = E(i) + 0.5*norm(e)^2;<br />
for k = l:-1:1<br />
w(1:n(k+1)+1,1:n(k)+1,k) = w(1:n(k+1)+1,1:n(k)+1,k) - ro*diag(sum(d,1))*gp(1:n(k+1)+1,k)*(io(1:n(k)+1,k)');<br />
d = d*wg(1:n(k+1)+1,1:n(k)+1,k);<br />
end<br />
end<br />
end<br />
<br />
====Some notes on the neural network and its learning algorithm====<br />
<br />
* The activation functions are usually linear around the origin. If this is the case, choosing random weights between the <math>\,-0.5</math> and <math>\,0.5</math>, and normalizing the data may boost up the algorithm in the very first steps of the procedure, as the linear combination of the inputs and weights falls within the linear area of the activation function.<br />
<br />
* Learning of the neural network using backpropagation algorithm takes place in epochs. An Epoch is a single pass through the entire training set.<br />
<br />
* It is a common practice to randomly change the permutation of the training data in each one of the epochs, to make the learning independent of the data permutation.<br />
<br />
* Given a set of data for training a neural network, one should keep aside a ratio of it as the validation dataset, to obtain a sufficient number of layers and number of neurons in each of the layers. The best construction may be the one which leads to the least error for the validation dataset. Validation data may not be used as the training of the network.<br />
<br />
* We can also use the validation-training scheme to estimate how many epochs is enough for training the network.<br />
<br />
* It is also common to use other optimization algorithms as steepest descent and conjugate gradient in a batch form.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
==== Difficulties of training deep architecture <ref>{{Cite journal | title = Exploring Strategies for Training Deep Neural Networks | url = http://jmlr.csail.mit.edu/papers/volume10/larochelle09a/larochelle09a.pdf | year = 2009 | journal = Journal of Machine Learning Research | page = 1-40 | volume = 10 | last1 = Larochelle | first1 = H. | last2 = Bengio | first2 = Y. | last3 = Louradour | first3 = J. | last4 = Lamblin | first4 = P. }}</ref> ====<br />
<br />
Given a particular task, a natural way to train a deep network is to frame it as an optimization<br />
problem by specifying a supervised cost function on the output layer with respect to the desired<br />
target and use a gradient-based optimization algorithm in order to adjust the weights and biases<br />
of the network so that its output has low cost on samples in the training set. Unfortunately, deep<br />
networks trained in that manner have generally been found to perform worse than neural networks<br />
with one or two hidden layers.<br />
<br />
We discuss two hypotheses that may explain this difficulty. The first one is that gradient descent<br />
can easily get stuck in poor local minima (Auer et al., 1996) or plateaus of the non-convex training<br />
criterion. The number and quality of these local minima and plateaus (Fukumizu and Amari, 2000)<br />
clearly also influence the chances for random initialization to be in the basin of attraction (via<br />
gradient descent) of a poor solution. It may be that with more layers, the number or the width<br />
of such poor basins increases. To reduce the difficulty, it has been suggested to train a neural<br />
network in a constructive manner in order to divide the hard optimization problem into several<br />
greedy but simpler ones, either by adding one neuron (e.g., see Fahlman and Lebiere, 1990) or one<br />
layer (e.g., see Lengell´e and Denoeux, 1996) at a time. These two approaches have demonstrated to<br />
be very effective for learning particularly complex functions, such as a very non-linear classification<br />
problem in 2 dimensions. However, these are exceptionally hard problems, and for learning tasks<br />
usually found in practice, this approach commonly overfits.<br />
<br />
This observation leads to a second hypothesis. For high capacity and highly flexible deep networks,<br />
there actually exists many basins of attraction in its parameter space (i.e., yielding different<br />
solutions with gradient descent) that can give low training error but that can have very different generalization<br />
errors. So even when gradient descent is able to find a (possibly local) good minimum<br />
in terms of training error, there are no guarantees that the associated parameter configuration will<br />
provide good generalization. Of course, model selection (e.g., by cross-validation) will partly correct<br />
this issue, but if the number of good generalization configurations is very small in comparison<br />
to good training configurations, as seems to be the case in practice, then it is likely that the training<br />
procedure will not find any of them. But, as we will see in this paper, it appears that the type of<br />
unsupervised initialization discussed here can help to select basins of attraction (for the supervised<br />
fine-tuning optimization phase) from which learning good solutions is easier both from the point of<br />
view of the training set and of a test set.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
===Business Applications of Neural Networks===<br />
<br />
Neural networks are increasingly being used in real-world business applications and, in some cases, such as fraud detection, they have already become the method of choice. Their use for risk assessment is also growing and they have been employed to visualize complex databases for marketing segmentation. This method covers a wide range of business interests — from finance management, through forecasting, to production. The combination of statistical, neural and fuzzy methods now enables direct quantitative studies to be carried out without the need for rocket-science expertise.<br />
<br />
* On the Use of Neural Networks for Analysis Travel Preference Data <br />
* Extracting Rules Concerning Market Segmentation from Artificial Neural Networks <br />
* Characterization and Segmenting the Business-to-Consumer E-Commerce Market Using Neural Networks<br />
* A Neurofuzzy Model for Predicting Business Bankruptcy <br />
* Neural Networks for Analysis of Financial Statements <br />
* Developments in Accurate Consumer Risk Assessment Technology <br />
* Strategies for Exploiting Neural Networks in Retail Finance <br />
* Novel Techniques for Profiling and Fraud Detection in Mobile Telecommunications<br />
* Detecting Payment Card Fraud with Neural Networks<br />
* Money Laundering Detection with a Neural-Network <br />
* Utilizing Fuzzy Logic and Neurofuzzy for Business Advantage</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=7368stat841f102010-10-25T21:32:09Z<p>Hclam: /* Multi-Class Logistic Regression & Perceptron - October 19, 2010 */</p>
<hr />
<div>==[[Proposal Fall 2010]] ==<br />
==[[statf10841Scribe|Editor sign up]] ==<br />
{{Cleanup|date=October 8 2010|reason=Provide a summary for each topic here.}}<br />
== Summary ==<br />
=== Classification ===<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>.<br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>.<br />
=== Principle Component Analysis ===<br />
Principal component analysis (PCA) is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf]. Depending on where this methodology is applied, other common names of PCA include the [http://en.wikipedia.org/wiki/Karhunen%E2%80%93Lo%C3%A8ve_theorem Karhunen–Loève transform (KLT)] , the [http://en.wikipedia.org/wiki/Harold_Hotelling Hotelling transform], and the proper orthogonal decomposition (POD). PCA is the simplist [http://en.wikipedia.org/wiki/Eigenvector eigenvector]-based [http://en.wikipedia.org/wiki/Multivariate_analysis multivariate analysis]. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or '''principal components''') of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could.<br />
<br />
==[[f10_Stat841_digest |Digest ]] ==<br />
<br />
== ''' Reference Textbook''' ==<br />
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
== ''' Classification - September 21, 2010''' ==<br />
<br />
=== Classification ===<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which ones were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle (384 BC - 322 BC) when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression], [http://en.wikipedia.org/wiki/Cluster_analysis clustering], and [http://en.wikipedia.org/wiki/Dimension_reduction dimensionality reduction] (feature extraction or manifold learning). Please be noted that some people consider classification to be a broad area that consists of both supervised and unsupervised methods of classifying data. In this view, as can be seen in [http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm this link], clustering is simply a special case of classification and it may be called '''unsupervised classification'''.<br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers, a link to a source of which can be found [http://www.e-knowledge.ca/quotes.php?topic=Knowledge here].<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers <br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that can take a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
<br />
The '''true error rate''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
<br />
In practice, the empirical error rate is obtained to estimate the true error rate, whose value is impossible to be known because the parameter values of the underlying process cannot be known but can only be estimated using available data. The empirical error rate, in practice, estimates the true error rate quite well in that, as mentioned [http://www.liebertonline.com/doi/pdf/10.1089/106652703321825928 here], it is an unbiased estimator of the true error rate.<br />
<br />
=== Bayes Classifier ===<br />
<br />
{{Cleanup|date=October 14 2010|reason=In response to the previous tag: The naive Bayes classifier is a special (simpler) case of the Bayes classifier. It uses an extra assumption: that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. This assumption allows for an easier likelihood function <math>\,f_y(x)</math> in the equation:<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
The simper form of the likelihood function seen in the naive Bayes is:<br />
:<math><br />
\begin{align}<br />
f_y(x) = P(X=x|Y=y) = {\prod_{i=1}^{n} P(X_{i}=x_{i}|Y=y)}<br />
\end{align}<br />
</math><br />
The Bayes classifier taught in class was not the naive Bayes classifier. Perhaps a comment should be made about the naive Bayes classifier in the body of the text}}<br />
<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' Theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers [1]. Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests][2].<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into the most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class <math> y \in \mathcal{Y} </math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the posterior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h^*</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
The Bayes classifier is the optimal classifier in that it results in the least possible true probability of misclassification for any given new data input, i.e., for any generic classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief. As a result, the estimated values of the components in the trained model may deviate quite a bit from their true population values, and this can ultimately cause the calculated posterior probabilities of inputs to deviate quite a bit from their true values. Estimation of all these probability functions, as likelihood, prior probability, and evidence function is a very expensive task, computationally, which also makes some other classifiers more favorable than Bayes classifier.<br />
<br />
A detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here].<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h^*</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h^*)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
'''Theorem'''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course. ''Note: these are the known y values in the training data.'' <br />
<br />
These known data are summarized in the following tables:<br />
<br />
:[[File:裁剪.jpg]]<br />
{{Cleanup|date=September 2010|reason=this graph is not complete, the reason is that it should be in consistent with the computation below.}}<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
<br />
The Bayesian view of probability states that, for any event E, event E has a [http://en.wikipedia.org/wiki/Prior_probability prior probability] that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability of event E's occurrence, can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>. If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow. This is because one cannot possibly carry out trials for any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h^*)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the hyperplane the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and we shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that both of the two classes have [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distributions] and the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> to <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of <math>\,D(h^*)</math> is as follows:<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out alike terms and factoring).<br />
<br />
It is easy to see that, under LDA, the Bayes's classifier's decision boundary <math>\,D(h^*)</math> has the form <math>\,ax+b=0</math> and it is linear in <math>\,x</math>. This is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\left( \mu_m^\top\Sigma^{-1}<br />
\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n) \right)=0</math> . In addition, for any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between the centers (means) of <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice, where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
According to [http://www.lsv.uni-saarland.de/Vorlesung/Digital_Signal_Processing/Summer06/dsp06_chap9.pdf this link], some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that the data in each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - September 23, 2010''' ==<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>.<br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>.<br />
<br />
The derivation of the Bayes classifier's decision boundary <math>\,D(h^*)</math> under QDA is similar to that under LDA. Again, let us first consider the two-classes case where <math>\, \mathcal{Y}=\{0, 1\}</math>. This derivation is given as follows: <br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math> (by cancellation)<br />
:<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_1|)-\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1)+\log(\pi_1)=-\frac{1}{2}\log(|\Sigma_0|)-\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0)+\log(\pi_0)</math> (by taking the log of both sides)<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top\Sigma_1^{-1}x + \mu_1^\top\Sigma_1^{-1}\mu_1 - 2x^\top\Sigma_1^{-1}\mu_1 - x^\top\Sigma_0^{-1}x - \mu_0^\top\Sigma_0^{-1}\mu_0 + 2x^\top\Sigma_0^{-1}\mu_0 \right)=0</math> (by expanding out)<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top(\Sigma_1^{-1}-\Sigma_0^{-1})x + \mu_1^\top\Sigma_1^{-1}\mu_1 - \mu_0^\top\Sigma_0^{-1}\mu_0 - 2x^\top(\Sigma_1^{-1}\mu_1-\Sigma_0^{-1}\mu_0) \right)=0</math> <br />
<br />
It is easy to see that, under QDA, the decision boundary <math>\,D(h^*)</math> has the form <math>\,ax^2+bx+c=0</math> and it is quadratic in <math>\,x</math>. This is where the word ''quadratic'' in quadratic discriminant analysis comes from.<br />
<br />
As is the case with LDA, QDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\log(\frac{|\Sigma_m|}{|\Sigma_n|})-\frac{1}{2}\left( x^\top(\Sigma_m^{-1}-\Sigma_n^{-1})x + \mu_m^\top\Sigma_m^{-1}\mu_m - \mu_n^\top\Sigma_n^{-1}\mu_n - 2x^\top(\Sigma_m^{-1}\mu_m-\Sigma_n^{-1}\mu_n) \right)=0</math>.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h^*(x) = \arg\max_{k} \delta_k(x)</math> <br />
where, <br />
* In the case of LDA, which assumes that a common covariance matrix is shared by all classes, <math> \,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is linear in <math>\,x</math>.<br />
<br />
* In the case of QDA, which assumes that each class has its own covariance matrix, <math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is quadratic in <math>\,x</math>.<br />
<br />
<br />
'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
[http://www.stat.cmu.edu/~larry/=stat707/notes10.pdf See Theorem 46.6 Page 133]<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the Maximum Likelihood estimates from the sample for <math>\,\pi,\mu_k,\Sigma_k</math> in place of their true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance, denoted <math>\Sigma</math>, is defined as the weighted average of the covariance for each class. <br />
<br />
In the case where we need a common covariance matrix, we get the estimate using the following equation:<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{n-k} </math><br />
<br />
Where: <math>\,n_r</math> is the number of data points in class r, <math>\,\Sigma_r</math> is the covariance of class r and <math>\,n</math> is the total number of data points,<br />
<math>\,k</math> is the number of classes.<br />
<br />
See the details about the [http://en.wikipedia.org/wiki/Estimation_of_covariance_matrices estimation of covarience matrices].<br />
<br />
===Computation For QDA And LDA===<br />
<br />
First, let us consider QDA, and examine each of the following two cases.<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
<math>\, \Sigma_k = I </math> for every class <math>\,k</math> implies that our data is spherical. This means that the data of each class <math>\,k</math> is distributed symmetrically around the center <math>\,\mu_k</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{-1}{2}log(|I|)</math>, is zero since <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximize <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = U_kS_kV_k^\top = U_kS_kU_k^\top </math> (In general when <math>\,X=U_kS_kV_k^\top</math>, <math>\,U_k</math> is the eigenvectors of <math>\,X_kX_k^T</math> and <math>\,V_k</math> is the eigenvectors of <math>\,X_k^\top X_k</math>. <br />
So if <math>\, X_k</math> is symmetric, we will have <math>\, U_k=V_k</math>. Here <math>\, \Sigma_k </math> is symmetric, because it is the covariance matrix of <math> X_k </math>) and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (U_kS_kU_k^\top)^{-1} = (U_k^\top)^{-1}S_k^{-1}U_k^{-1} = U_kS_k^{-1}U_k^\top </math> (since <math>\,U_k</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top U_kS_k^{-1}U_k^T(x-\mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-1}(U_k^\top x-U_k^\top \mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top x-U_k^\top\mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the squared Euclidean distance between <math> \, S_k^{-\frac{1}{2}}U_k^\top x </math> and <math>\, S_k^{-\frac{1}{2}}U_k^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>.<br />
<br />
A similar transformation of all the centers can be done from <math>\,\mu_k</math> to <math>\,\mu_k^*</math> where <math> \, \mu_k^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top \mu_k </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math> and <math>\,\mu_k^*</math>, treating them as in Case 1 above.<br />
<br />
{{Cleanup|date=October 18 2010|reason=The sentence above may cause some misleading. In general case, <math>\,\Sigma_k </math> may not be the same . So you can't treat them completely the same as in Case 1 above. You need to compute <math>\, log{|\Sigma_k |} </math> differently. Here is a detailed discussion below:}}<br />
{{Cleanup|date=October 18 2010|reason=The sentence above is right since by transforming<math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>, the new variable variance is <math>I</math>}}<br />
<br />
<br />
Note that when we have multiple classes, we also need to compute <math>\, log{|\Sigma_k|}</math> respectively. Then we compute <math> \,\delta_k </math> for QDA .<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, in another word, have same covariance <math>\,\Sigma_k</math>,else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
{{Cleanup|date=October 18 2010|reason=The statement above may not be true, because in assignment 1, we did do the QDA computation using this approach although the corresponding three covarience matrices are different, the reason why the answer is Yes is as below }}<br />
<br />
The answer is Yes. Consider that you have two classes with different shapes. Given a data point, justify which class this point belongs to. You just do the transformations corresponding to the 2 classes respectively, then you get <math>\,\delta_1 ,\delta_2 </math> ,then you determine which class the data point belongs to by comparing <math> \,\delta_1 </math> and <math> \,\delta_2 </math> .<br />
<br />
In summary, to apply QDA on a data set <math>\,X</math>, in the general case where <math>\, \Sigma_k \ne I </math> for each class <math>\,k</math>, one can proceed as follows:<br />
<br />
:: Step 1: For each class <math>\,k</math>, apply singular value decomposition on <math>\,X_k</math> to obtain <math>\,S_k</math> and <math>\,U_k</math>.<br />
<br />
:: Step 2: For each class <math>\,k</math>, transform each <math>\,x</math> belonging to that class to <math>\,x^* = S_k^{-\frac{1}{2}}U_k^\top x</math>, and transform its center <math>\,\mu_k</math> to <math>\,\mu_k^* = S_k^{-\frac{1}{2}}U_k^\top \mu_k</math>.<br />
<br />
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x^*</math> and the transformed center <math>\,\mu^*</math> of each class, and assign <math>\,x</math> to the class such that the squared Euclidean distance between <math>\,x^*</math> and <math>\,\mu^*</math> is the least over all of the classes.<br />
<br />
<br />
Now, let us consider LDA. <br />
Here, one can derive a classification scheme that is quite similar to that shown above. The main difference is the assumption of a common variance across the classes, so we perform the Singular Value Decomposition once, as opposed to k times.<br />
<br />
To apply LDA on a data set <math>\,X</math>, one can proceed as follows:<br />
<br />
:: Step 1: Apply singular value decomposition on <math>\,X</math> to obtain <math>\,S</math> and <math>\,U</math>.<br />
<br />
:: Step 2: For each <math>\,x \in X</math>, transform <math>\,x</math> to <math>\,x^* = S^{-\frac{1}{2}}U^\top x</math>, and transform each center <math>\,\mu</math> to <math>\,\mu^* = S^{-\frac{1}{2}}U^\top \mu</math>.<br />
<br />
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x^*</math> and the transformed center <math>\,\mu^*</math> of each class, and assign <math>\,x</math> to the class such that the squared Euclidean distance between <math>\,x^*</math> and <math>\,\mu^*</math> is the least over all of the classes.<br />
<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In actual data scenarios, it is generally true that QDA will provide a better classifier for the data then LDA because QDA does not assume that the covariance matrix for each class is identical, as LDA assumes. However, QDA still assumes that the class conditional distribution is Gaussian, which is not always the case in real-life scenarios. The link provided at the beginning of this paragraph describes a kernel-based QDA method which does not have the Gaussian distribution assumption.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA - September 28, 2010==<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension. Pay attention, We don't do QDA with LDA. If we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below applies LDA to the same data set and reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that the algorithm created to separate the data into the two classes.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the classes of the points 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x*y+%g*y^2', k, l(1), l(2), q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that QDA is only correct in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve produced by QDA that do not lie on the correct side of the line produced by LDA.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Principal Component Analysis - September 30, 2010==<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br />
<br /><br />
Principal component analysis (PCA) is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf]. Depending on where this methodology is applied, other common names of PCA include the [http://en.wikipedia.org/wiki/Karhunen%E2%80%93Lo%C3%A8ve_theorem Karhunen–Loève transform (KLT)] , the [http://en.wikipedia.org/wiki/Harold_Hotelling Hotelling transform], and the proper orthogonal decomposition (POD). PCA is the simplist [http://en.wikipedia.org/wiki/Eigenvector eigenvector]-based [http://en.wikipedia.org/wiki/Multivariate_analysis multivariate analysis]. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or '''principal components''') of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could. <br />
<br />
<br />
Furthermore, if one considers the lower dimensional representation produced by PCA as a least squares fit of our original data, then it can also be easily shown that this representation is the one that minimizes the reconstruction error of our data. It should be noted however, that one usually does not have control over which dimensions PCA deems to be the most informative for a given set of data, and thus one usually does not know which dimensions PCA selects to be the most informative dimensions in order to create the lower-dimensional representation. <br />
<br />
<br />
Suppose <math>\,X</math> is our data matrix containing <math>\,d</math>-dimensional data. The idea behind PCA is to apply [http://en.wikipedia.org/wiki/Singular_value_decomposition singular value decomposition] to <math>\,X</math> to replace the rows of <math>\,X</math> by a subset of it that captures as much of the [http://en.wikipedia.org/wiki/Variance variance] in <math>\,X</math> as possible. First, through the application of singular value decomposition to <math>\,X</math>, PCA obtains all of our data's directions of variation. These directions would also be ordered from left to right, with the leftmost directions capturing the most amount of variation in our data and the rightmost directions capturing the least amount. Then, PCA uses a subset of these directions to map our data from its original space to a lower-dimensional space. <br />
<br />
<br />
By applying singular value decomposition to <math>\,X</math>, <math>\,X</math> is decomposed as <math>\,X = U\Sigma V^T \,</math>. The <math>\,d</math> columns of <math>\,U</math> are the [http://en.wikipedia.org/wiki/Eigenvector eigenvectors] of <math>\,XX^T \,</math>.<br />
The <math>\,d</math> columns of <math>\,V</math> are the eigenvectors of <math>\,X^TX \,</math>. The <math>\,d</math> diagonal values of <math>\,\Sigma</math> are the square roots of the [http://en.wikipedia.org/wiki/Eigenvalue eigenvalues] of <math>\,XX^T \,</math> (also of <math>\,X^TX \,</math>), and they correspond to the columns of <math>\,U</math> (also of <math>\,V</math>). <br />
<br />
<br />
We are interested in <math>\,U</math>, whose <math>\,d</math> columns are the <math>\,d</math> directions of variation of our data. Ordered from left to right, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most informative direction of variation of our data. That is, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most effective column in terms of capturing the total variance exhibited by our data. A subset of the columns of <math>\,U</math> is used by PCA to reduce the dimensionality of <math>\,X</math> by projecting <math>\,X</math> onto the columns of this subset. In practice, when we apply PCA to <math>\,X</math> to reduce the dimensionality of <math>\,X</math> from <math>\,d</math> to <math>\,k</math>, where <math>k < d\,</math>, we would proceed as follows:<br />
<br />
:: Step 1: Center <math>\,X</math> so that it would have zero mean.<br />
<br />
:: Step 2: Apply singular value decomposition to <math>\,X</math> to obtain <math>\,U</math>.<br />
<br />
:: Step 3: Suppose we denote the resulting <math>\,k</math>-dimensional representation of <math>\,X</math> by <math>\,Y</math>. Then, <math>\,Y</math> is obtained as <math>\,Y = U_k^TX</math>. Here, <math>\,U_k</math> consists of the first (leftmost) <math>\,k</math> columns of <math>\,U</math> that correspond to the <math>\,k</math> largest diagonal elements of <math>\,\Sigma</math>.<br />
<br />
<br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 130 images (28 by 23 pixels) of handwritten threes. <br />
{{Cleanup|date=September 6 2010|reason=If anyone can tell me where I can find the 2-3 data set, I would create the new image. In the mean time, I found a non-copyrighted image of different looking 3s online, but as you can see, it is not as nice as one we could make.}}<br />
{{Cleanup|date=September 6 2010|reason=I think you can find it on your UW-ACE account for this course.}}<br />
<br />
[[File:Handwritten 3s.gif]]<br />
<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 130 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[Image:23plotPCA.jpg]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slant of each handwritten digit.<br />
<br />
===Derivation of the first Principle Component===<br />
{{Cleanup|date=October 2010|reason=I think English of this section must be improved}}<br />
We want to find the direction of maximum variation. Let <math>\begin{align}\textbf{w}\end{align}</math> be an arbitrary direction, <math>\begin{align}\textbf{x}\end{align}</math> a data point and <math>\begin{align}\displaystyle u\end{align}</math> the length of the projection of <math>\begin{align}\textbf{x}\end{align}</math> in direction <math>\begin{align}\textbf{w}\end{align}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\begin{align}\textbf{w}\end{align}</math> is the same as <math>\begin{align}c\textbf{w}\end{align}</math>, for any scalar <math>c</math>, so without loss of generality, we assume that: <br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}.<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}. <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} .</math><br />
<br /><br /><br />
The above is the variance of <math>\begin{align}\displaystyle u \end{align}</math> formed by the weight vector <math>\begin{align}\textbf{w} \end{align}</math>. The first principal component is the vector <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the function. Our goal is to find the weight <math>\begin{align}\textbf{w} \end{align}</math> that maximizes this variability, subject to a constraint. Since our function is convex, it has no maximum value. Therefore we need to add a constraint that restricts the length of <math>\begin{align}\textbf{w} \end{align}</math>. However, we are only interested in the direction of the variability, so the problem becomes<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
s.t. <math>\textbf{w}^T \textbf{w} = 1.</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|.<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
====Lagrange Multiplier====<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector then the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
{{Cleanup|date=October 2010|reason=It is good discussion, what will happen if we don't have distinct eigenvalues and eigenvectors? What does this situation mean? }}<br />
{{Cleanup|date=October 2010|reason=If the eigenvalues are not distinct, I suppose we could still take the leftmost eigenvector by default. Not sure if this is the correct approach, so can anyone please explain further? Thanks }}<br />
{{Cleanup|date=October 2010|reason= As U is the eigenvector of a symetric matrix, is it possible that we have 2 similar eigen vector? }}<br />
<br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = Var(\sum_{i = 1}^n x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
m_X=mean(X,2);<br />
mm=repmat(m_X,1,300);<br />
XX=X-mm;<br />
[u s v] = svd(XX);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
xHat=xHat+mm;<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. This is because almost all of the noise in the noisy image is captured by the principal components (directions of variation) that capture the least amount of variation in the image, and these principal components were discarded when we used the few principal components that capture most of the image's variation to generate the image's lower-dimensional representation. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Extraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data.<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized as follows (taken from the Lecture Slides).<br />
<br />
====Algorithm ====<br />
'''Recover basis:''' Calculate <math> XX^T =\Sigma_{i=1}^{n} x_i x_{i}^{T} </math> and let <math> U=</math> eigenvectors of <math> X X^T </math> corresponding to the top <math> d </math> eigenvalues.<br />
<br />
'''Encoding training data:''' Let <math>Y=U^TX </math> where <math>Y</math> is a <math>d \times n</math> matrix of encoding of the original data.<br />
<br />
'''Reconstructing training data:''' <math>\hat{X}= UY=UU^TX </math>.<br />
<br />
'''Encode set example:''' <math> y=U^T x </math> where <math> y </math> is a <math>d-</math>dimentional encoding of <math>x</math>.<br />
<br />
'''Reconstruct test example:''' <math>\hat{x}= Uy=UU^Tx </math>.<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{D \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, U_dY_{d \times n} </math>, where <math>\,U_d</math> contains the first (leftmost) <math>\,d</math> columns of <math>\,U</math>.<br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
== Fisher's (Linear) Discriminant Analysis (FDA) - Two Class Problem - October 5, 2010 ==<br />
<br />
===Sir Ronald A. Fisher===<br />
Fisher's Discriminant Analysis (FDA), also known as Fisher's Linear Discriminant Analysis (LDA) in some sources, is a classical feature extraction technique. It was originally described in 1936 by Sir [http://en.wikipedia.org/wiki/Ronald_A._Fisher Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here]. <br />
In this paper Fisher used for the first time the term DISCRIMINANT FUNCTION. The term DISCRIMINANT ANALYSIS was introduced later by Fisher himself in a subsequent paper which can be found [http://digital.library.adelaide.edu.au/coll/special//fisher/155.pdf here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
As in PCA, the goal of FDA is to project the data in a lower dimension. You might ask, why was FDA invented when PCA already existed? There is a simple explanation for this that can be found [http://www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf here]. PCA is an unsupervised method for classification, so it does not take into account the labels in the data. Suppose we have two clusters that have very different or even opposite labels from each other but are nevertheless positioned in a way such that they are very much parallel to each other and also very near to each other. In this case, most of the total variation of the data is in the direction of these two clusters. If we use PCA in cases like this, then both clusters would be projected onto the direction of greatest variation of the data to become sort of like a single cluster after projection. PCA would therefore mix up these two clusters that, in fact, have very different labels. What we need to do instead, in this cases like this, is to project the data onto a direction that is orthogonal to the direction of greatest variation of the data. This direction is in the least variation of the data. On the 1-dimensional space resulting from such a projection, we would then be able to effectively classify the data, because these two clusters would be perfectly or nearly perfectly separated from each other taking into account of their labels. This is exactly the idea behind FDA. <br />
<br />
The main difference between FDA and PCA is that, in FDA, in contrast to PCA, we are not interested in retaining as much of the variance of our original data as possible. Rather, in FDA, our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for a direction that is most representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
Suppose we have 2-dimensional data, then FDA would attempt to project the data of each class onto a point in such a way that the resulting two points would be as far apart from each other as possible. Intuitively, this basic idea behind FDA is the optimal way for separating each pair of classes along a certain direction. <br />
<br />
{{Cleanup|date=October 2010|reason= Just a thought: how relevant is "Dimensionality reduction techniques" to the concept of "subspace clustering"? As in subspace clustering, the goal is to find a set of features (relevant features, the concept is referred to as local feature relevance in the literature) in the high dimensional space, where potential subspaces accommodating different classes of data points can be defined. This means; the data points are dense when they are considered in a subset of dimensions (features).}}<br />
{{Cleanup|date=October 2010|reason=If I'm not mistaken, classification techniques like FDA use labeled training data whereas clustering techniques use unlabeled training data instead. Any other input regarding this would be much appreciated. Thanks}}<br />
{{Cleanup|date=October 2010|reason=An extension of clustering is subspace clustering in which different subspace are searched through to find the relavant and appropriate dimentions. High dimentional data sets are roughly equiedistant from each other, so feature selection methods are used to remove the irrelavant dimentions. These techniques do not keep the relative distance so PCA is not useful for these applications. It should be noted that subspace clustering localize their search unlike feature selection algorithms.for more information click here[http://portal.acm.org/citation.cfm?id=1007731]}}<br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes:<br />
<br><br />
For a 2-classes problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k-classes problem, we want to reduce the data to k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within-variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
{{Cleanup|date=October2010|reason=Anyone please add an example to make the comparison clearer}}<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. In the paper,"[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] " written by our instructor, they propose a closed-form solution to one algorithm that previously required expensive semidefinite optimization. They provide a new problem setup in which the algorithm performs better or as well as some standard methods, but without the computational complexity. Furthermore, they show a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). They also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
===FDA Goals===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
==== Example in R ====<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
<br />
FDA projects the data into lower dimensional space, where the distances between the projected means are maximum and the within-class variances are minimum. There are two categories of classification problems:<br />
<br />
1. Two-class problem<br />
<br />
2. Multi-class problem (addressed next lecture)<br />
<br />
=== Two-class problem ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within each class''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances (the summation of the two covariances is a valid covariance, satisfying the symmetry and positive semi-definite criteria). <br />
<br />
{{Cleanup|date=October 2010|reason=In 2. above, I wonder if the computation would be much more complex if we instead find a weighted sum of the covariances of the two classes where the weights are the sizes of the two classes?}}<br />
<br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> where <math>\ z_i </math> is a scalar<br />
<br />
====1. Minimizing within-class variance==== <br />
<math>\displaystyle \min_w (\underline{w}^T\sum_1\underline{w}) </math><br />
<br />
<math>\displaystyle \min_w (\underline{w}^T\sum_2\underline{w}) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math><br />
<br> (where <math>\,\sum_1</math> and <math>\,\sum_2 </math> are the covariance matrices of the 1st and 2nd classes of data, respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within-classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math>.<br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br /> <br />
<math>\displaystyle \max_w ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2, </math><br />
<br /><br /><br />
<math>\begin{align} ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2 &= (\underline{w}^T \mu_1 - \underline{w}^T \mu_2)^T(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\<br />
&= (\mu_1^T\underline{w} - \mu_2^T\underline{w})(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\<br />
&= (\mu_1 - \mu_2)^T \underline{w} \underline{w}^T (\mu_1 - \mu_2) \\<br />
<br />
&= ((\mu_1 - \mu_2)^T \underline{w})^{T} (\underline{w}^T (\mu_1 - \mu_2))^{T} \\<br />
&= \underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T \underline{w} \end{align}</math><br /><br />
<br />
Note that in the last line above the order is rearranged clockwise because the answer is a scalar.<br />
<br />
Let <math>\displaystyle s_B = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math>, the between-class covariance, then the goal is to <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math>.<br />
<br />
===The Objective Function for FDA===<br />
We want an objective function which satisfies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math> or <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math><br />
# <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math> <br />
<br /><br /><br />
So, we construct our objective function as maximizing the ratio of the two goals brought above:<br /><br />
<br /><br />
<math>\displaystyle \max_w \frac{(\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w})} {(\underline{w}^T(\sum_1 + \sum_2)\underline{w})} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max_w \frac{(\underline{w}^Ts_B\underline{w})}{(\underline{w}^Ts_w\underline{w})}</math> <br /><br />
One may argue that we can use subtraction for this purpose, while this approach is true but it can be shown it will need another scaling factor. Thus using this ratio is more efficient.<br />
<br />
As the objective function is convex, and so it does not have a maximum. To get around this problem, we have to add the constraint that w must have unit length, and then solvethis optimization problem we form the lagrangian:<br />
<br />
<br /><br /><br />
<math>\displaystyle L(\underline{w},\lambda) = \underline{w}^Ts_B\underline{w} - \lambda (\underline{w}^Ts_w\underline{w} -1)</math><br /><br /><br />
<br />
<br /><br />
Then, we equate the partial derivative of L with respect to <math>\underline{w}</math>:<br />
<math>\displaystyle \frac{\partial L}{\partial \underline{w}}=2s_B \underline{w} - 2\lambda s_w \underline{w} = 0 </math> <br /><br />
<br />
<math>s_B \underline{w} = \lambda s_w \underline{w}</math><br /><br />
<math>s_w^{-1}s_B \underline{w}= \lambda\underline{w}</math><br /><br /><br />
This is in the form of generalized eigenvalue problem. Therefore, <math> \underline{w}</math> is the largest eigenvector of <math>s_w^{-1}s_B </math><br /><br />
<br />
This solution can be further simplified as follow:<br /><br />
<br />
<math>s_w^{-1}(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w} = \lambda\underline{w} </math><br /><br />
<br />
Since <math>(\mu_1 - \mu_2)^T\underline{w}</math> is a scalar then <math>s_w^{-1}(\mu_1 - \mu_2)</math>∝<math>\underline{w}</math> <br /><br /><br />
This gives the direction of <math>\underline{w}</math> without doing eigenvalue decomposition in the case of 2-class problem.<br />
<br />
Note: In order for <math>{s_w}</math> to have an inverse, it must have full rank. This can be achieved by ensuring that the number of data points <math>\,\ge</math> the dimensionality of <math>\underline{x_{i}}</math>.<br />
<br />
===FDA Using Matlab===<br />
Note: ''The following example was not actually mentioned in this lecture''<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods. <br />
%First of all, we generate the two data set:<br />
% First data set X1<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300);<br />
%In this case: <br />
mu_1=[1;1]; <br />
Sigma_1=[1 1.5; 1.5 3]; <br />
%where mu and sigma are the mean and covariance matrix.<br />
% Second data set X2<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300); <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
plot(X1(:,1),X1(:,2),'.b'); hold on;<br />
plot(X2(:,1),X2(:,2),'ob')<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
% Combine data sets to map both into the same subspace<br />
X=[X1;X2];<br />
X=X';<br />
% We used built-in PCA function in Matlab<br />
[coefs, scores]=princomp(X);<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=sw\[4; 2] % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is very little overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
===Some of FDA applications===<br />
There are many applications for FDA in many domains some of them are stated below:<br />
<br />
* SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS<br />
FDA can be used to enhance listening comprehension when the user goes from a sound<br />
environment to another different one. For more information review this paper by Alexandre et al.[http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/papers/1569101740.pdf here]<br />
<br />
* Application to Face Recognition<br />
FDA can be used in face recognition at different situation. Using FDA Kong et al. proposes an Application to Face<br />
Recognition with Small Number of Training Samples [http://person.hst.aau.dk/pimuller/2D_FDA_Face_CVPR05fish.pdf here].<br />
<br />
* Palmprint Recognition<br />
FDA is used in biometrics, to implement an automated palmprint recognition system. See An Automated Palmprint Recognition System by Tee et al. [http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6V09-4FJ5XPN-1-1&_cdi=5641&_user=1067412&_pii=S0262885605000089&_origin=search&_coverDate=05%2F01%2F2005&_sk=999769994&view=c&wchp=dGLbVzz-zSkWb&md5=a064b67c9bdaaba7e06d800b6c9b209b&ie=/sdarticle.pdf here].<br />
<br />
{{Cleanup|date=October 2010|reason=I think briefing about the other applications would be easier than browsing through all of these applications}}<br />
<br />
{{Cleanup|date=October 2010|reason= This link is no longer valid.}}<br />
<br />
other applications could found in references 4,5,6,7,8 and more in [http://www.sciencedirect.com/science?_ob=ArticleListURL&_method=list&_ArticleListID=1489148820&_sort=r&_st=13&view=c&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=f210273546a659c90ae0962fce7b8b4e&searchtype=a here]<br />
<br />
=== '''References'''===<br />
1. Kong, H.; Wang, L.; Teoh, E.K.; Wang, J.-G.; Venkateswarlu, R.; , "A framework of 2D Fisher discriminant analysis: application to face recognition with small number of training samples," Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , vol.2, no., pp. 1083- 1088 vol. 2, 20-25 June 2005<br />
doi: 10.1109/CVPR.2005.30<br />
[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1467563&isnumber=31473 1]<br />
<br />
2. Enrique Alexandre, Roberto Gil-Pita, Lucas Cuadra, Lorena A´lvarez, Manuel Rosa-Zurera, "SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS USING A TWO-LAYER CLASSIFICATION SYSTEM WITH MSE LINEAR DISCRIMINANTS", 16th European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland, August 25-29, 2008, copyright by EURASIP, [http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/welcome.html 2]<br />
<br />
3. Connie, Tee; Jin, Andrew Teoh Beng; Ong, Michael Goh Kah; Ling, David Ngo Chek; "An automated palmprint recognition system", Journal of Image and Vision Computing, 2005. [http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V09-4FJ5XPN-1&_user=1067412&_coverDate=05/01/2005&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_searchStrId=1489147048&_rerunOrigin=google&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=a781a68c29fbf127473ae9baa5885fe7&searchtype=a 3]<br />
<br />
4. met, Francesca; Boqué, Ricard; Ferré, Joan; "Application of non-negative matrix factorization combined with Fisher's linear discriminant analysis for classification of olive oil excitation-emission fluorescence spectra", Journal of Chemometrics and Intelligent Laboratory Systems, 2006.<br />
[http://www.sciencedirect.com/science/article/B6TFP-4HR769Y-1/2/b5244d459265abb3a1bf5238132c737e 4]<br />
<br />
5. Chiang, Leo H.;Kotanchek, Mark E.;Kordon, Arthur K.; "Fault diagnosis based on Fisher discriminant analysis and support vector machines"<br />
Journal of Computers & Chemical Engineering, 2004<br />
[http://www.sciencedirect.com/science/article/B6TFT-4B4XPRS-1/2/bca7462236924d29ea23ec633a6eb236 5]<br />
<br />
6. Yang, Jian ;Frangi, Alejandro F.; Yang, Jing-yu; "A new kernel Fisher discriminant algorithm with application to face recognition", 2004<br />
[http://www.sciencedirect.com/science/article/B6V10-4997WS1-1/2/78f2d27c7d531a3f5faba2f6f4d12b45 6]<br />
<br />
7. Cawley, Gavin C.; Talbot, Nicola L. C.; "Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers", Journal of Pattern Recognition , 2003 [http://www.sciencedirect.com/science/article/B6V14-492718R-1/2/bd6e5d0495023a1db92ab7169cc96dde 7]<br />
<br />
8. Kodipaka, S.; Vemuri, B.C.; Rangarajan, A.; Leonard, C.M.; Schmallfuss, I.; Eisenschenk, S.; "Kernel Fisher discriminant for shape-based classification in epilepsy" Hournal Medical Image Analysis, 2007. [http://www.sciencedirect.com/science/article/B6W6Y-4MH8BS0-1/2/055fb314828d785a5c3ca3a6bf3c24e9 8]<br />
<br />
==Fisher's (Linear) Discriminant Analysis (FDA) - Multi-Class Problem - October 7, 2010==<br />
<br />
====Obtaining Covariance Matrices====<br />
<br />
<br />
The within-class covariance matrix <math>\mathbf{S}_{W}</math> is easy to obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between-class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to compute directly. To bypass this problem we use the following method. We know that the total covariance <math>\,\mathbf{S}_{T}</math> of a given set of data is constant and known, and we can also decompose this variance into two parts: the within-class variance <math>\mathbf{S}_{W}</math> and the between-class variance <math>\mathbf{S}_{B}</math> in a way that is similar to [http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA]. We thus have:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
where the total variance is given by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = <br />
\frac{1}{n}<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
We can now get <math>\mathbf{S}_{B}</math> from the relationship: <br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
<br />
Actually, there is another way to obtain <math>\mathbf{S}_{B}</math>. Suppose the data contains <math>\, k </math> classes, and each class <math>\, j </math> contains <math>\, n_{j} </math> data points. We denote the overall mean vector by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\frac{1}{n} \sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within-class covariance <math>\mathbf{S}_{W}</math><br />
and the between-class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term in the final line of the derivation above as the between-class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^* =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})^{T})<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))((\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W})<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the following as our measure:<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to the largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
{{Cleanup|date=What if we encounter complex eigenvalues? Then concept of being large does not dense. What is the solution in that case? }}<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
<br />
Recall that the Frobenius norm of <math>X</math> is <br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2_{2} = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have the following classic criterion function that Fisher used<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to the largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
{{Cleanup|date=October 2010|reason=Adding more general comments about the advantages and flaws of FDA would be effective here.}}<br />
<br />
{{Cleanup|date=October 2010|reason=Would you please show how could we reconstruct our original data from the data that its dimentionality is reduced by FDA.}}<br />
{{Cleanup|date=October 2010|reason= When you reduce the dimensionality of data in most general form you lose some features of the data and you cannot reconstruct the data from redacted space unless the data have special features that help you in reconstruction like sparsity. In FDA it seems that we cannot reconstruct data in general form using reducted version of data }}<br />
<br />
===Generalization of Fisher's Linear Discriminant Analysis ===<br />
<br />
Fisher's Linear Discriminant Analysis (Fisher, 1936) is very popular among users of discriminant analysis. Some of the reasons for this are its simplicity<br />
and lack of necessity for strict assumptions. However, it has optimality properties only if the underlying distributions of the groups are multivariate normal. It is also easy to verify that the discriminant rule obtained can be very harmed by only a small number of outlying observations. Outliers are very hard to detect in multivariate data sets and even when they are detected simply discarding them is not the most efficient way of handling the situation. Therefore, there is a need for robust procedures that can accommodate the outliers and are not strongly affected by them. Then, a generalization of Fisher's linear discriminant algorithm [[http://www.math.ist.utl.pt/~apires/PDFs/APJB_RP96.pdf]] is developed to lead easily to a very robust procedure.<br />
<br />
Also notice that LDA can be seen as a dimensionality reduction technique. In general k-class problems, we have k means which lie on a linear subspace with dimension k-1. Given a data point, we are looking for the closest class mean to this point. In LDA, we project the data point to the linear subspace and calculate distances within that subspace. If the dimensionality of the data, d, is much larger than the number of classes, k, then we have a considerable drop in dimensionality from d dimensions to k - 1 dimensions.<br />
<br />
==Linear and Logistic Regression - October 12, 2010==<br />
<br />
===Linear Regression===<br />
Linear regression is an approach for modeling the scalar value <math>\, y</math> from a set of dependent variables <math>\,X</math>. In linear regression the goal is to find an appropriate set of dependent variables to <math>\, y</math> and try to estimate its value from the related set. While in classification the goal is to classify data to different groups in which the inner similarity among the group members are more than variables which belong to different groups.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
According to Bayes Classification we estimate the posterior as,<br/><br />
<math>P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{l}f_{l}(x)\pi_{l}}</math><br/><br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The simple linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
y_i = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
and we can denote it as<br />
:<math><br />
\begin{align}<br />
\mathbf{y} = \beta^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
where <math>\,\beta^{T} = (<br />
\beta_1,..., \beta_{d},\beta_0)</math> is a <math>1 \times (d+1)</math> vector and <math>\mathbf{X}=<br />
\begin{pmatrix}<br />
\mathbf{x}_{1}, \dots,\mathbf{x}_{n}\\<br />
1, \dots, 1<br />
\end{pmatrix}<br />
</math> is a <math>(d+1) \times n</math> matrix, here <math>\mathbf{x}_{i} </math> is a <math>d \times 1</math> vector<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{n}</math> and <math>\,y_{1}, ..., y_{n}</math> our goal is to find <math>\,\beta^{T}</math> such that the linear model fits the data while minimizing sum of squared errors using the [http://en.wikipedia.org/wiki/Least_squares Least Squares method].<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
We then try to minimize the residual sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\beta^{T}\mathbf{X})^{T}(\mathbf{y}-\beta^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}(\mathbf{y}-\beta^{T}\mathbf{X})^{T}<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}(\mathbf{y}^{T}-\mathbf{X}^{T}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}\mathbf{y}^{T}<br />
\end{align}<br />
</math><br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \hat\beta^{T}\mathbf{X} = <br />
\mathbf{y}\mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X} </math> is called the [http://en.wikipedia.org/wiki/Hat_matrix hat matrix].<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{l}f_{l}(x)\pi_{l}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1. This is more direct approach to classification since it do not need to estimate <math>\ f_k(x) </math> and <math>\ \pi_k </math>.<br />
<math>\ 1 \times P(Y=1|X=x)+0 \times P(Y=0|X=x)=E(Y|X) </math>. <br />
This model does not classify Y between 0 and 1, so it is not good but at times it can lead to a decent classifier. <math>\ y_i=\frac{1}{n_1} </math> <math>\ \frac{-1}{n_2} </math><br />
<br />
===Logistic Regression===<br />
The [http://en.wikipedia.org/wiki/Logistic_regression logistic regression] model arises from the desire to model the posterior probabilities of the <math>\displaystyle K</math> classes via linear functions in <math>\displaystyle x</math>, while at the same time ensuring that they sum to one and remain in [0,1].Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood ,using <math>\displaystyle Pr(Y|X)</math>. Since <math>\displaystyle Pr(Y|X)</math> completely specifies the conditional distribution, the multinomial distribution is appropriate. This model is widely used in biostatistical applications for two classes. For instance: people survive or die, have a disease or not, have a risk factor or not.<br />
<br />
==== logistic function ====<br />
[[File:200px-Logistic-curve.svg.png | Logistic Sigmoid Function]]<br />
<br />
<br />
<br />
A [http://en.wikipedia.org/wiki/Logistic_function logistic function] or logistic curve is the most common sigmoid curve. <br />
<br />
1. <math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
2. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
3. <math>y(0) = \frac{1}{2}</math><br />
<br />
4. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
5. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
====Intuition behind Logistic Regression====<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
====The Logistic Regression Model====<br />
<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
{{Cleanup|date=October 13 2010|reason=It could be useful to have sources for these graphs. We don't know if they are copyrighted}}<br />
{{Cleanup|date=October 18 2010|reason=I Could not find any source for these graphs. However, they following the definition of the defined probability. I don't think the generated graph as it is here is copyrighted, but if you worried you can draw this figure by applying the function and post the result.}}<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
{{Cleanup|date=October 13 2010|reason=It could be useful to have sources for these graphs. We don't know if they are copyrighted}}<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
====Fitting a Logistic Regression====<br />
Logistic regression tries to fit a distribution. The common practice in statistics is to fit density function, posterior density of each class(Pr(Y|X), to data using [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence and identical distribution)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
{{Cleanup|date=October 13 2010|reason=I think, in the following, y_i * x_i and the single x_i on the right side should both be transposed by matrix calculus?}}<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math> <br />
<br />
There are n+1 nonlinear equations in <math> \beta </math>. The first column is vector 1, then <math>\ \sum_{i=1}^n {y_i} =\sum_{i=1}^n p(\underline{x}_i;\underline{\beta}) </math> i.e. the expected number of class ones matches the observed number.<br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
====Extension====<br />
<br />
* When we are dealing with a problem with more than two classes, we need to generalize our logistic regression to a [http://en.wikipedia.org/wiki/Multinomial_logit Multinomial Logit model].<br />
<br />
* Limitations of Logistic Regression:<br />
:1. We know that there is no assumptions made about the distributions of the features of the data (i.e. the explanatory variables). However, the features should not be highly correlated with one another because this could cause problems with estimation.<br />
:2. Large number of data points (i.e.the sample sizes) are required for logistic regression to provide sufficient numbers in both classes. The more number of features/dimensions of the data, the larger the sample size required.<br />
<br />
==Lecture summary==<br />
{{Cleanup|date=October 18 2010|reason=Can anybody provide a better lecture summary? The one below is to just get it started}}<br />
In this lecture an introduction of the linear regression was presented as well as defining the density function for two-class problem. Maximum likelihood was used to define the distribution parameters (i.e. fitting density function to the logistic class.<br />
<br />
== Logistic Regression Cont. - October 14, 2010 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Estimating Parameters <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
'''Newton-Raphson Algorithm:'''<br /><br />
<br />
If we want to find <math>\ x^* </math> such that <math>\ f(x^*)=0</math><br />
<br />
We first pick a starting point <math>x^* = x^{old}</math> and and we solve:<br />
<br \><br />
<br />
<math>\ x^{*} \leftarrow x^{old}-\frac {f(x^{old})}{\partial f(x^{old})} </math> <br /><br />
<math> \ x^{old} \leftarrow x^{*}</math> <br />
<br /><br />
This is repeated till convergence <br />
<br />
If we want to maximize or minimize <math>\ f(x) </math>, then solve for <math>\ \partial f(x)=0 </math><br />
<br />
<math>\ X^{new} \leftarrow x^{old}-\frac {\partial f(x^{old})}{\partial^2 f(x^{old})} </math><br />
<br />
<br /><br />
<br />
In vector notation the above can be written as <br /><br />
<br />
<math><br />
X^{new} \leftarrow X^{old} - H^{-1}\Delta<br />
</math><br />
<br /><br />
H is the [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix] or second derivative matrix and <math>\Delta</math> is the gradient both evaluated at <math>X^{old}</math> <br />
<br /><br />
<br />
'''note:''' If the Hessian is not invertible the [http://en.wikipedia.org/wiki/Generalized_inverse generalized inverse] or pseudo inverse can be used<br />
<br /><br />
<br /><br />
<br />
<br />
As shown above ,the [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or Hessian.<br />
<br />
<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}_i))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{(d+1)}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i,i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^{T}\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
This is an adjusted response and it is solved repeatedly when <math>\ p </math>, <math>\ W </math>, and <math>\ z </math> changes. This algorithm is called [http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares iteratively reweighted least squares] because it solves the weighted least squares problem repeatedly.<br />
<br />
Recall that linear regression by least squares finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}^{T}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow arg \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta}^T)</math><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_{i,i}</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They both attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They both have linear boundaries.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1. There exists a closed form solution for least squares.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1. No closed form solution exists<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quadratically w.r.t dimension.<br />
#LDA estimate parameters more efficiently by using more information about data and samples without class labels can be also used in LDA. <br />
<br />
{{Cleanup|date=October 2010|reason= Could somebody please validate the following points}} <br />
{{Cleanup|date=October 2010|reason= I'm not too sure about the first point either, but it seems reasonable to me. Would be great if someone can confirm this point. Thanks}} <br />
<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust. (For high dimensionality logistic regression is more accommodating)<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
#Logistic regression is more robust, because it does not assume normal distribution regarding each independent variable.<br />
<br />
Many other advantages of logistic regression are explained [http://www.statgun.com/tutorials/logistic-regression.html here].<br />
<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>\,(d+1)</math><math>\,\times</math><math>\,(k-1)</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
===Lecture Summary===<br />
<br />
Traditionally logistic regression parameters are estimated using maximum likelihood. However , other optimization techniques may be used as well.<br />
<br /><br />
Since there is no closed form solution for finding the zero of the first derivative of the log likelihood the Newton Raphson algorithm is used. Since the problem is convex Newtons is guaranteed to converge to a global optimum.<br />
<br /><br />
Logistic regression requires less parameters than LDA or QDA and is therefore more favorable for high dimensional data.<br />
<br />
===Supplements===<br />
<br />
A detailed proof that logistic regression is convex is available [http://people.csail.mit.edu/jrennie/writing/convexLR.pdf here]. See '1 Binary LR' for the case we discussed in lecture.<br />
<br />
== '''Multi-Class Logistic Regression & Perceptron - October 19, 2010''' ==<br />
<br />
=== Lecture Summary ===<br />
<br />
In this lecture, the topic of logistic regression was finalized by covering the multi-class logistic regression and a new topic on perceptron was introduced. Perceptron is a linear classifier for two-class problems. The main goal of perceptron is classify data in 2 classes by minimizing the distances between the misclassified points and the decision boundary. This will be continued in the following lectures.<br />
<br />
=== Multi-Class Logistic Regression ===<br />
Recall that in two-class logistic regression, the posterior probability of one of the classes (say class 0) is modeled by a function in the form shown in figure 1. <br />
<br />
The posterior probability of the second class (say class 1) is the complement of the first class (class 0). <br /><br /><br />
<math>\displaystyle P(Y=0 | X=x) = 1 - P(Y=1 | X=x)</math><br /><br />
<br />
This function is called sigmoid logistic function, which is the reason why this algorithm is called "logistic regression".<br />
[[File:Picture1.png|150px|thumb|right|<math>Fig.1: P(Y=1 | X=x)</math>]]<br />
<br />
<math>\displaystyle \sigma\,\!(a) = \frac {e^a}{1+e^a} = \frac {1}{1+e^{-a}}</math><br /><br /><br />
<br />
In two-class logistic regression, we compare the posterior of one class to the other one using this ratio:<br /><br />
<br />
:<math> \frac{P(Y=1|X=x)}{P(Y=0|X=x)}</math><br /><br />
<br />
If we look at the natural logarithm of this ratio, we find that it is always a linear function in <math>x</math>:<br /><br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\underline{\beta}^T\underline{x} \quad \rightarrow (*)</math> <br /><br /><br />
<br />
What if we have more than two classes?<br /><br />
<br />
Using (*), we can extend the notion of logistic regression for the cases where we have more than two classes.<br /><br />
<br />
Assume we have <math>k</math> classes. Looking at the logarithm of the ratio of posteriors of each class and the k<sup>th</sup> class, we have: <br /><br />
<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_1}^T\underline{x} </math> <br /><br />
:<math>\log\left(\frac{P(Y=2|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_2}^T\underline{x} </math> <br /><br />
::::<math> \vdots</math><br /><br />
:<math>\log\left(\frac{P(Y=k-1|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_{k-1}}^T\underline{x} </math> <br /><br />
<br />
<br />
Although in the above posterior ratios, the denominator is chosen to be the posterior of the last class (class k), the choice of denominator is arbitrary in that the posterior estimates are equivariant under this choice - [http://www.springerlink.com/content/t45k620382733r71/ Linear Methods for Classification].<br /><br /><br />
<br />
Each of these functions is linear in <math>x</math>, however, we have different <math>\underline{\,\beta_{i}}</math>'s. We have to make sure that, the densities assigned to different classes sum to one.<br /><br /><br />
<br />
In general, we can write:<br />
<br /><math>P(Y=c | X=x) = \frac{e^{\underline{\beta_c}^T \underline{x}}}{1+\sum_{l=1}^{k-1}e^{\underline{\beta_l}^T \underline{x}}},\quad c \in \{1,\dots,k-1\} </math><br /><br />
<br /><math>P(Y=k | X=x) = \frac{1}{1+\sum_{l=1}^{k-1}e^{\underline{\beta_l}^T \underline{x}}}</math><br /><br />
These posteriors clearly sum to one. <br /><br /><br />
<br />
In the case of two-class problem, it is pretty simple to find <math>\beta</math> parameter (the <math>\beta</math> in two-class linear regression problems has <math>(d+1)\times1</math> dimension), as mentioned in previous lectures. In the multi-class case the iterative Newton method can be used, but here <math>\beta</math> is of size <math>(d+1)\times(k-1)</math> and the weight matrix W is a dense and non-diagonal matrix. This results in computationally inefficient, however feasible-to-be-solved algorithm. A trick would be to re-parametrize the logistic regression problem by expanding the input vector <math>x</math> (Question.4 in assignment no.2).<br />
<br /><br /><br />
<br />
It can be noted here that logistic regression do not assume a distribution for the prior where as LDA assumes the prior to be Bernulli. <br /><br /><br />
<br />
===Nueral Network Concept===<br />
The concept of constructing an artificial neural network comes from scientists who like to simulate human neural network in their computers. They were trying to create computer programs that can learn like people. Neural network is a method in artificial intelligence which is a simplified models of neural processing in the brain, even though the relation between this model and brain biological architecture is not cleared yet.<br />
<br />
=== Perceptron ===<br />
<br />
[http://en.wikipedia.org/wiki/Perceptron Perceptron] was invented in 1957 by [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt]. It is the basic building block of feedforward neural networks<br /><br /><br />
<br />
We know that least squares obtained by regression of -1/1 response variable <math>\displaystyle Y</math> on observation <math>\displaystyle x</math> lead to the same coefficients as LDA (recall that LDA minimizes the distance between discriminant function (decision boundary) and the data points). Least squares returns the sign of the linear combination of features as the class labels (Figure 2). This concept was called perceptron in Engineering literature during the 1950's. <br /><br /><br />
<br />
[[File:Perceptron.jpg|371px|thumb|right| Fig.2 Diagram of a linear perceptron ]]<br />
<br />
There is a cost function <math>\displaystyle D</math> that perceptron tries to minimize:<br /><br />
<br />
<math>D(\underline{\beta},\beta_0)=-\sum_{i \in M}y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math><br /><br />
<br />
where <math>\displaystyle M</math> is a set of misclassified points. <br /><br />
<br />
This is basically minimizing the sum of distances between the misclassified points and the decision boundary.<br /><br /><br />
<br />
'''Derivation''':'' The distances between the misclassified points and the decision boundary''.<br /><br /><br />
<br />
Consider points <math>\underline{x_1}</math>, <math>\underline{x_2}</math> and a decision boundary defined as <math>\underline{\beta}^T\underline{x}+\beta_0</math> as shown in figure 3.<br /><br />
<br />
[[File:DB.jpg|248px|thumb|right| Fig.3 Distance from the decision boundary ]]<br />
<br />
Both <math>\underline{x_1}</math> and <math>\underline{x_2}</math> lie on the decision boundary, then we have:<br /><br />
<math>\underline{\beta}^T\underline{x_1}+\beta_0=0 \rightarrow (1)</math><br /><br />
<math>\underline{\beta}^T\underline{x_2}+\beta_0=0 \rightarrow (2)</math><br /><br />
<br />
From (1) and (2):<br /><br />
<math>\underline{\beta}^T(\underline{x_2}-\underline{x_1})=0</math><br /><br />
<br />
Therefore, <math>\displaystyle \underline{\beta}</math> is orthogonal to <math>\underline{x_2}-\underline{x_1}</math> which is in the same direction with the decision boundary, which means that <math>\displaystyle \underline{\beta}</math> is orthogonal to the decision boundary. <br /><br />
<br />
Then the distance of a point <math>\underline{x_0}</math> from the decision boundary is: <br /><br />
<br />
<math>\underline{\beta}^T(\underline{x_0}-\underline{x_2})</math><br /><br />
<br />
From (2): <br /><br />
<br />
<math>\underline{\beta}^T\underline{x_2}= -\beta_0</math>. <br /><br />
<math>\underline{\beta}^T(\underline{x_0}-\underline{x_2})=\underline{\beta}^T\underline{x_0}-\underline{\beta}^T\underline{x_2}=\underline{\beta}^T\underline{x_0}+\beta_0</math><br /><br />
<br />
Therefore, distance between any point <math>\underline{x_{i}}</math> to the discriminant hyperplane is defined by <math>\underline{\beta}^T\underline{x_{i}}+\beta_0</math>.<br /><br /><br />
<br />
However, this quantity is not always positive. Considering <math>y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math>, if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive, since both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> are positive or both are negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'', then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> is positive and the other one is negative; hence, the product <math>y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math> will be negative for a misclassified point. The "-" sign in <math>D(\underline{\beta},\beta_0)</math> makes this cost function always positive. <br /><br /><br />
<br />
==Perceptron Learning Algorithm and Feed Forward Neural Networks - October 21, 2010 ==<br />
===Lecture Summary===<br />
In this lecture, we finalize our discussion of the Perceptron by reviewing its learning algorithm, which is based on gradient descent. We then begin the next topic, Neural Networks (NN), and we focus on a NN that is useful for classification: the Feed Forward Neural Network (FFNN). The mathematical model for the FFNN is shown, and we review one of its most popular learning algorithms: Back-Propagation. <br />
<br />
To open the Neural Network discussion, we present a formulation of the universal function approximator. The mathematical model for Neural Networks is then built upon this formulation. We also discuss the trade-off between training error and testing error -- known as the generalization problem -- under the universal function approximator section.<br />
<br />
===Perceptron===<br />
The last lecture introduced the Perceptron and showed how it can suggest a solution for the 2-class classification problem. We saw that the solution requires minimization of a cost function, which is basically a summation of the distances of the misclassified data points to the separating hyperplane. This cost function is<br />
<br />
<math>D(\underline{\beta},\beta_0)=-\sum_{i \in M}y_{i}(\underline{\beta}^T \underline{x}_i+\beta_0),</math><br />
<br />
in which, <math>\,M</math> is the set of misclassified points. Thus, the objective is to find <math>\arg\min_{\underline{\beta},\beta_0} D(\underline{\beta},\beta_0)</math>.<br />
<br />
====Perceptron Learning Algorithm====<br />
To minimize <math>D(\underline{\beta},\beta_0)</math>, an algorithm that uses gradient-descent has been suggested. Gradient descent, also known as steepest descent, is a numerical optimization technique that starts from an initial value for <math>(\underline{\beta},\beta_0)</math> and recursively approaches an optimal solution. Each step of recursion updates <math>(\underline{\beta},\beta_0)</math> by subtracting from it a factor of the gradient of <math>D(\underline{\beta},\beta_0)</math>. Mathematically, this gradient is<br />
<br />
<math>\nabla D(\underline{\beta},\beta_0)<br />
= \left( \begin{array}{c}\cfrac{\partial D}{\partial \underline{\beta}} \\ \\ <br />
\cfrac{\partial D}{\partial \beta_0} \end{array} \right)<br />
= \left( \begin{array}{c} -\displaystyle\sum_{i \in M}y_{i}\underline{x}_i^T \\ <br />
-\displaystyle\sum_{i \in M}y_{i} \end{array} \right)</math><br />
<br />
However, the perceptron learning algorithm does not use the sum of the contributions from each observation to calculate the gradient for each step. Instead, each step uses the gradient contribution from only a single observation, and each successive step uses a different observation. This slight modification is called stochastic gradient descent. That is, instead of subtracting some factor of <math>\nabla D(\underline{\beta},\beta_0)</math> at each step, we subtract a factor of<br />
<br />
<math>\left( \begin{array}{c} y_{i}\underline{x}_i \\ <br />
y_{i} \end{array} \right)</math><br />
<br />
As a result, the pseudo code for the Perceptron Learning Algorithm is as follows:<br />
<br />
:1) Choose a random initial value for <math>(\underline{\beta},\beta_0)</math>.<br />
<br />
:2) <math>\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{old}}\\<br />
\beta_0^{\mathrm{old}}<br />
\end{pmatrix}<br />
\leftarrow<br />
\begin{pmatrix}<br />
\underline{\beta}^0\\<br />
\beta_0^0<br />
\end{pmatrix}</math><br />
<br />
:3) <math>\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{new}}\\<br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
\leftarrow<br />
\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{old}}\\<br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix}<br />
y_i \underline{x_i}\\<br />
y_i<br />
\end{pmatrix}</math> for some <math>\,i \in M</math>.<br />
<br />
:4) If the termination criterion has not been met, go back to step 3 and use a different observation datapoint (i.e. a different <math>\,i</math>).<br />
<br />
The learning rate <math>\,\rho</math> controls the step size of convergence toward <math>\min_{\underline{\beta},\beta_0} D(\underline{\beta},\beta_0)</math>. A larger value for <math>\,\rho</math> causes the steps to be larger. If <math>\,\rho</math> is set to be too large, however, then the minimum could be missed (over-stepped).<br />
In practice, <math>\rho</math> can be adaptive and not fixed, it means that, in the first steps <math>\rho</math> could be larger than the last steps. At the beginning, larger <math>\rho</math> helps to find the approximate answer sooner. And smaller <math>\rho</math> in last steps help to tune the final answer more accurately. <br />
<br />
<br />
As mentioned earlier, the learning algorithm uses just one of the data points at each iteration; this is the common practice when dealing with online applications. In an online application, datapoints are accessed one-at-a-time because training data is not available in batch form. The learning algorithm does not require the derivative of the cost function with respect to the previously seen points; instead, we just have to take into consideration the effect of each new point.<br />
<br />
One way that the algorithm could terminate is if there are no more mis-classified points (i.e. if set <math>\,M</math> is empty. As long as there are points in <math>\,M</math>, the algorithm continues until some other termination criterion is reached. Termination criterion for an optimization algorithm is usually convergence, but for numerical methods this is not well-defined. In theory, convergence is realized when the gradient of the cost function is zero; in numerical methods an answer close to zero within some margin of error is taken instead.<br />
<br />
Since the data is linearly-separable, the solution is theoretically guaranteed to converge in a finite number of iterations. This number of iterations depends on the <br />
<br />
* learning rate <math>\,\rho</math><br />
<br />
* initial value <math>(\underline{\beta},\beta_0)</math><br />
<br />
* difficulty of the problem. The problem is more difficult if the gap between the classes of data is very small.<br />
<br />
Note that we consider the offset term <math>\beta_0</math> separately from the <math>\underline{\beta}</math> to distinguish this formulation from those in which the direction of the hyperplane (<math>\underline{\beta}</math>) has been considered.<br />
<br />
A major concern about gradient descent is that it may get trapped in local optimal solutions.<br />
<br />
====Some notes on the Perceptron Learning Algorithm====<br />
<br />
* If there is access to the training data points in a batch form, we should better take advantage of a closed optimization technique like least-squares or maximum-likelihood estimation for linear classifiers. (These closed solutions has been around many years before invention of the Perceptron).<br />
<br />
* Just like the linear classifier, a Perceptron can discriminate between only two classes at a time, and one can generalize its performance for multi-class problems by using one of the <math>k-1</math>, <math>k</math>, or <math>k(k-1)/2</math>-hyperplane methods.<br />
<br />
* If the two classes are linearly separable, the algorithm will converge in a finite number of iterations to a hyperplane, which makes the error of training data zero. The convergence is guaranteed if the learning rate is set adequately.<br />
<br />
* If the two classes are not linearly separable, the algorithm will never converge. So, one may think of a termination criterion in these cases. (e.g. a maximum number of iterations in which convergence is expected, or the rate of changes in both a cost function and its derivative).<br />
<br />
* In the case of linearly separable classes, the final solution and number of iterations will be dependent on the initial conditions, learning rate, and the gap between the two classes. In general, a smaller gap between classes requires a greater number of iterations for the algorithm to converge.<br />
<br />
* Learning rate --or updating step-- has a direct impact on both number of iterations and the accuracy of the solution for the optimization problem. Smaller quantities for this factor make convergence slower, even though we will end up with a more accurate solution. In the opposite way, larger values for learning rate make the process faster, even though we may lose some precision. So, one may make a balance for this trade-off in order to get fast enough to an accurate enough solution. (exploration vs. exploitation)<br />
<br />
In the upcoming lectures, we introduce the Support Vector Machines (SVM), which use a method similar in iterational optimization scheme to what the Perceptron suggests, but have a different definition for the cost function.<br />
<br />
===Universal Function Approximator===<br />
The universal function approximator is a mathematical formulation for a group of estimation techniques. The usual formulation for it is<br />
<br />
<math>\hat{Y}(x)=\sum\limits_{i=1}^{n}\alpha_i\sigma(\omega_i^Tx+b_i),</math><br />
<br />
where <math>\hat{Y}(x)</math> is an estimation for a function like <math>\,Y(x)</math>. According to the universal approximation theorem we have<br />
<br />
<math>|\hat{Y}(x) - Y(x)|<\epsilon,</math><br />
<br />
which means that <math>\hat{Y}(x)</math> can get as close to <math>\,Y(x)</math>, as necessary.<br />
<br />
This formulation assumes that the output, <math>\,Y(x)</math>, is a linear combination of a set of functions like <math>\,\sigma(.)</math> where <math>\,\sigma(.)</math> is a nonlinear function of the inputs or <math>\,x_i</math>s.<br />
<br />
====Generalization Factors====<br />
Even though this formulation represents a universal function approximator, which means that it can be fitted to a set of data as closely as demanded, the closeness of fit must be carefully decided upon. In many cases, the purpose of the model is to target unseen data. However, the fit to this unseen data is impossible to determine before it arrives.<br />
<br />
To overcome this dilemma, a common practice is to divide the test data points into two sets: training data and validation data. We use the training data to estimate the fixed parameters for the model, and then use the validation data to find values for the construction-dependent parameters. How these construction-dependent parameters vary depends on the model. In the case of a polynomial, the construction-dependent parameter would be its highest degree, and for a neural network, the construction-dependent parameter could be the number of hidden layers and the number of neurons in each layer.<br />
<br />
These matters on model generalization vs. complexity matters will be discussed with more detail in the lectures to follow.<br />
<br />
===Feed-Forward Neural Network===<br />
The Neural Network (NN) is one application of the universal function approximator. It can be thought of as a system of Perceptrons linked together as units of a network. One particular NN useful for classification is the Feed-Forward Neural Network (FFNN), which consists of multiple "hidden layers" of Perceptron units. Our discussion here is based around the FFNN, which has a toplogy shown in Figure 1. The first hidden layer of units receive input from the original features. Between the hidden layers, connections from each unit are always directed to units in the next adjacent layer. The output layer, which receives input only from the last hidden layer, each unit produces a target measurement for a distinct class (i.e. <math>\,K</math> classes require <math>\,K</math> units). In Figure 1, the units in a single layer are distributed vertically, and the inputs and outputs of the network are shown as the far left and right layers respectively.<br />
<br />
[[File:FFNN.png|300px|thumb|right|Fig.1 A common architecture for the FFNN]]<br />
<br />
====Mathematical Model of the FFNN with One Hidden Layer====<br />
The FFNN with one hidden layer for a <math>\,K</math>-class problem is defined as follows. Let <math>\,d</math> be the number of input features, <math>\,p</math> be the number of units in the hidden layer, and <math>\,K</math> be the number of classes (i.e. the number of units in the output layer).<br />
<br />
Each neural unit calculates its derived feature (i.e. output) using a linear combination of its inputs. Suppose <math>\,\underline{x}</math> is the <math>\,d</math>-dimensional vector of input features. Then, each neural unit uses a <math>\,d</math>-dimensional vector of weights to combine these input features: for the <math>\,i</math>th neural unit, let <math>\underline{u}_i</math> be this vector of weights. The linear combination calculated by the <math>\,i</math>th unit is then given by<br />
<br />
<math>a_i = \underline{u}_i^T\underline{x}</math><br />
<br />
However, we want the derived feature to lie between 0 and 1, so we apply an ''activating function'' <math>\,\sigma(a)</math>. The derived feature for the <math>\,i</math>th unit is then given by<br />
<br />
<math>\,z_i = \sigma(a_i)</math> where <math>\,\sigma</math> is typically the logistic function<br />
<br />
<math>\sigma(a) = \cfrac{1}{1+e^{-a}}</math><br />
<br />
Now, we place each of the derived features <math>\,z_i</math> from the hidden layer into a <math>\,p</math>-dimensional vector:<br />
<br />
<math>\underline{z} = \left[ \begin{array}{c} z_1 \\ z_2 \\ \vdots \\ z_p \end{array}\right]</math><br />
<br />
Like in the hidden layer, each unit in the output layer calculates its derived feature using a linear combination of its inputs. Each neural unit uses a <math>\,p</math>-dimensional vector of weights to combine the input features derived from the hidden layer. Let <math>\,\underline{w}_k</math> be this vector of weights used in the <math>\,k</math>th unit. The linear combination calculated by the <math>\,k</math>th unit is then given by<br />
<br />
<math>\hat{y}_k = \underline{w}_k^T\underline{z}</math><br />
<br />
<math>\,y_k</math> is thus the target measurement for the <math>\,k</math>th class. Note that an activation function <math>\,\sigma</math> is not used here.<br />
<br />
Notice that in each of the units, two operations take place:<br />
<br />
* a linear combination of the neuron's inputs is calculated using corresponding weights<br />
<br />
* a nonlinear operation on the linear combination is performed. <br />
<br />
These two calculations are shown in Figure 2. <br />
<br />
The nonlinear function <math>\,\sigma(.)</math> is called the activation function. Activation functions, like the logarithmic function shown earlier, are usually continuous and have a limited range. Another common activation function used in neural networks is <math>\,tanh(x)</math> (Figure 3).<br />
<br />
[[File:neuron2.png|300px|thumb|right|Fig.2 A general construction for a single neuron]]<br />
[[File:actfcn.png|300px|thumb|right|Fig.3 <math>tanh</math> as activation function]]<br />
<br />
The NN can be applied as a regression method or as a classifier, and the output layer differs depending on the application. The major difference between regression and classification is in the output space of the model, which is continuous in the case of regression, and discrete in the case of classification. For a regression task, no consideration is needed beyond what has already been mentioned earlier, since the outputs of the network would already be continuous. However, to use the neural network as a classifier, a threshold stage is necessary.<br />
<br />
====Mathematical Model of the FFNN with Multiple Hidden Layers====<br />
In the FFNN model with a single hidden layer, the derived features were represented as elements of the vector <math>\underline{z}</math>, and the original features were represented as elements of the vector <math>\underline{x}</math>. In the FFNN model with more than one hidden layer, <math>\underline{z}</math> is processed by the second hidden layer in the same way that <math>\underline{x}</math> was processed by the first hidden layer. Perceptrons in the second layer each use their own combination of weights to calculate a new set of derived features. These new derived features are processed by the third hidden layer in a similar way, and the cycle repeats for each additional hidden layer. This progression of processing is depicted in Figure 4.<br />
<br />
====Back-Propagation Learning Algorithm====<br />
<br />
[[File:bpl.png|300px|thumb|right|Fig.4 Labels for weights and derived features in the FFNN.]]<br />
<br />
Every linear-combination calculation in the FFNN involves weights that need to be set, and these weights are set using training data and an algorithm called Back-Propagation. This algorithm is similar to the gradient-descent algorithm introduced in the discussion of the Perceptron. The primary difference is that the gradient used in Back-Propagation is calculated in a more complicated way.<br />
<br />
First of all, we want to minimize the error between the estimated and true target measurements for the training data. That is, if <math>\,U</math> is the set of all weights in the FFNN, then we want to determine<br />
<br />
<math>\arg\min_U \left|y - \hat{y}\right|^2</math><br />
<br />
Now, suppose the hidden layers of the FFNN are labelled as in Figure 4. Then, we want to determine the derivative of <math>\left|y - \hat{y}\right|^2</math> with respect to each weight in the hidden layers of the FFNN. For weights <math>\,u_{jl}</math> this means we will need to find<br />
<br />
<math><br />
\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}}<br />
= \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j}\cdot<br />
\cfrac{\partial a_j}{\partial u_{jl}} = \delta_{j}z_l<br />
</math><br />
<br />
However, the closed-form solution for <math>\,\delta_{j}</math> is unknown, so we develop a recursive definition (<math>\,\delta_{j}</math> in terms of <math>\,\delta_{i}</math>):<br />
<br />
<math><br />
\delta_j = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j} <br />
= \sum_{i=1}^p \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_i}\cdot<br />
\cfrac{\partial a_i}{\partial a_j} <br />
= \sum_{i=1}^p \delta_i\cdot u_{ij} \cdot \sigma'(a_j)<br />
= \sigma'(a_j)\sum_{i=1}^p \delta_i \cdot u_{ij}<br />
</math><br />
<br />
We also need to determine the derivative of <math>\left|y - \hat{y}\right|^2</math> with respect to each weight in the ''output layer'' <math>\,k</math> of the FFNN (this layer is not shown in Figure 4, but it would be the next layer to the right of the rightmost layer shown). For weights <math>\,u_{ki}</math> this means<br />
<br />
<math><br />
\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{ki}}<br />
= \cfrac{\partial \left|y - \sum_i u_{ki}z_i\right|^2}{\partial u_{ki}}<br />
= -2(y - \sum_i u_{ki}z_i)z_i<br />
= -2(y - \hat{y})z_i<br />
</math><br />
<br />
With similarity to our computation of <math>\,\delta_j</math>, we define<br />
<br />
<math>\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_k}</math><br />
<br />
However, <math>\,a_k = \hat{y}</math> because an activation function is not applied in the output layer. So, our calculation becomes<br />
<br />
<math>\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial \hat{y}}<br />
= -2(y - \hat{y})</math><br />
<br />
Now that we have <math>\,\delta_k</math> and a recursive definition for <math>\,\delta_j</math>, it is clear that our weights can be deduced by starting from the output layer and working through the hidden layers through toward the input layer.<br />
<br />
Based on the above derivation, our algorithm for determining weights in the FFNN is as follows<br />
<br />
:1) Choose a random initial weights.<br />
<br />
:2) Apply a new datapoint <math>\underline{x}</math> to the FFNN as the input layer, and calculate the values for all units.<br />
<br />
:3) Compute <math>\,\delta_k = -2(y_k - \hat{y}_k)</math>.<br />
<br />
:4) Back-propagate layer-by-layer by computing <math>\delta_j = \sigma'(a_j)\sum_{i=1}^p \delta_i \cdot u_{ij}</math> for all units.<br />
<br />
:5) Compute <math>\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} = \delta_{j}z_l</math> for all weights <math>\,u_{jl}</math>.<br />
<br />
:6) Update <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}}<br />
- \rho \cdot \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} </math> for all weights <math>\,u_{jl}</math> where <math>\,\rho</math> is the learning rate.<br />
<br />
:7) If the termination criterion has not been met, go back to step 2 and apply another datapoint (ie. begin a new "epoch").<br />
<br />
====Alternative Description of the Back-Propagation Algorithm====<br />
Label the inputs and outputs of the <math>\,i</math>th hidden layer <math>\underline{x}_i</math> and <math>\underline{y}_i</math> respectively, and let <math>\,\sigma(.)</math> be the activation function for all neurons. We now have<br />
<br />
<math>\begin{align}<br />
\begin{cases}<br />
\underline{y}_1=\sigma(W_1.\underline{x}_1),\\<br />
\underline{y}_2=\sigma(W_2.\underline{x}_2),\\<br />
\underline{y}_3=\sigma(W_3.\underline{x}_3),<br />
\end{cases}<br />
\end{align}</math><br />
<br />
Where <math>\,W_i</math> is a matrix of the connection's weights, between two layers of <math>\,i</math> and <math>\,i+1</math>, and has <math>\,n_i</math> columns and <math>\,n_i+1</math> rows, where <math>\,n_i</math> is the number of neurons of the <math>\,i^{th}</math> layer.<br />
<br />
Considering this matrix equations, one can imagine a closed form for the derivative of the error in respect to the weights of the network. For a neural network with two hidden layers, the equations are as follows.<br />
<br />
<math>\begin{align}<br />
\frac{\partial E}{\partial W_3}=&diag(e).\sigma'(W_3.\underline{x}_3).(\underline{x}_3)^T,\\<br />
\frac{\partial E}{\partial W_2}=&\sigma'(W_2.\underline{x}_2).(\underline{x}_2)^T.diag\{\sum rows\{diag(e).diag(\sigma'(W_3.\underline{x}_3)).W_3\}\},\\<br />
\frac{\partial E}{\partial W_1}=&\sigma'(W_1.\underline{x}_1).(\underline{x}_1)^T.diag\{\sum rows\{diag(e).diag(\sigma'(W_3.\underline{x}_3)).W_3.diag(\sigma'(W_2.\underline{x}_2)).W_2\}\},<br />
\end{align}</math><br />
<br />
where <math>\,\sigma'(.)</math> is the derivative of the activation function <math>\,\sigma(.)</math>.<br />
<br />
Using this closed form derivative, it is possible to code the procedure for any number of layers and neurons. Here is a Matlab code for backpropagation algorithm. (<math>\,tanh</math> is utilized as the activation function.)<br />
<br />
<br />
while i < ep<br />
i = i + 1;<br />
data = shuffle(data,2);<br />
for j = 1:Q<br />
io = zeros(max(n)+1,length(n));<br />
gp = io;<br />
io(1:n(1)+1,1) = [1;data(1:f,j)];<br />
for k = 1:l<br />
io(2:n(k+1)+1,k+1) = w(2:n(k+1)+1,1:n(k)+1,k)*io(1:n(k)+1,k);<br />
gp(1:n(k+1)+1,k) = [0;1./(cosh(io(2:n(k+1)+1,k+1))).^2];<br />
io(1:n(k+1)+1,k+1) = [1;tanh(io(2:n(k+1)+1,k+1))];<br />
wg(1:n(k+1)+1,1:n(k)+1,k) = diag(gp(1:n(k+1)+1,k))*w(1:n(k+1)+1,1:n(k)+1,k);<br />
end<br />
e = [0;io(2:n(l+1)+1,l+1) - data(f+1:dd,j)];<br />
wg(1:n(l+1)+1,1:n(l)+1,l) = diag(e)*wg(1:n(l+1)+1,1:n(l)+1,l);<br />
gp(1:n(l+1)+1,l) = diag(e)*gp(1:n(l+1)+1,l);<br />
d = eye(n(l+1)+1);<br />
E(i) = E(i) + 0.5*norm(e)^2;<br />
for k = l:-1:1<br />
w(1:n(k+1)+1,1:n(k)+1,k) = w(1:n(k+1)+1,1:n(k)+1,k) - ro*diag(sum(d,1))*gp(1:n(k+1)+1,k)*(io(1:n(k)+1,k)');<br />
d = d*wg(1:n(k+1)+1,1:n(k)+1,k);<br />
end<br />
end<br />
end<br />
<br />
====Some notes on the neural network and its learning algorithm====<br />
<br />
* The activation functions are usually linear around the origin. If this is the case, choosing random weights between the <math>\,-0.5</math> and <math>\,0.5</math>, and normalizing the data may boost up the algorithm in the very first steps of the procedure, as the linear combination of the inputs and weights falls within the linear area of the activation function.<br />
<br />
* Learning of the neural network using backpropagation algorithm takes place in epochs. An Epoch is a single pass through the entire training set.<br />
<br />
* It is a common practice to randomly change the permutation of the training data in each one of the epochs, to make the learning independent of the data permutation.<br />
<br />
* Given a set of data for training a neural network, one should keep aside a ratio of it as the validation dataset, to obtain a sufficient number of layers and number of neurons in each of the layers. The best construction may be the one which leads to the least error for the validation dataset. Validation data may not be used as the training of the network.<br />
<br />
* We can also use the validation-training scheme to estimate how many epochs is enough for training the network.<br />
<br />
* It is also common to use other optimization algorithms as steepest descent and conjugate gradient in a batch form.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
==== Difficulties of training deep architecture <ref>{{Cite journal | title = Exploring Strategies for Training Deep Neural Networks | url = http://jmlr.csail.mit.edu/papers/volume10/larochelle09a/larochelle09a.pdf | year = 2009 | journal = Journal of Machine Learning Research | page = 1-40 | volume = 10 | last1 = Larochelle | first1 = H. | last2 = Bengio | first2 = Y. | last3 = Louradour | first3 = J. | last4 = Lamblin | first4 = P. }}</ref> ====<br />
<br />
Given a particular task, a natural way to train a deep network is to frame it as an optimization<br />
problem by specifying a supervised cost function on the output layer with respect to the desired<br />
target and use a gradient-based optimization algorithm in order to adjust the weights and biases<br />
of the network so that its output has low cost on samples in the training set. Unfortunately, deep<br />
networks trained in that manner have generally been found to perform worse than neural networks<br />
with one or two hidden layers.<br />
<br />
We discuss two hypotheses that may explain this difficulty. The first one is that gradient descent<br />
can easily get stuck in poor local minima (Auer et al., 1996) or plateaus of the non-convex training<br />
criterion. The number and quality of these local minima and plateaus (Fukumizu and Amari, 2000)<br />
clearly also influence the chances for random initialization to be in the basin of attraction (via<br />
gradient descent) of a poor solution. It may be that with more layers, the number or the width<br />
of such poor basins increases. To reduce the difficulty, it has been suggested to train a neural<br />
network in a constructive manner in order to divide the hard optimization problem into several<br />
greedy but simpler ones, either by adding one neuron (e.g., see Fahlman and Lebiere, 1990) or one<br />
layer (e.g., see Lengell´e and Denoeux, 1996) at a time. These two approaches have demonstrated to<br />
be very effective for learning particularly complex functions, such as a very non-linear classification<br />
problem in 2 dimensions. However, these are exceptionally hard problems, and for learning tasks<br />
usually found in practice, this approach commonly overfits.<br />
<br />
This observation leads to a second hypothesis. For high capacity and highly flexible deep networks,<br />
there actually exists many basins of attraction in its parameter space (i.e., yielding different<br />
solutions with gradient descent) that can give low training error but that can have very different generalization<br />
errors. So even when gradient descent is able to find a (possibly local) good minimum<br />
in terms of training error, there are no guarantees that the associated parameter configuration will<br />
provide good generalization. Of course, model selection (e.g., by cross-validation) will partly correct<br />
this issue, but if the number of good generalization configurations is very small in comparison<br />
to good training configurations, as seems to be the case in practice, then it is likely that the training<br />
procedure will not find any of them. But, as we will see in this paper, it appears that the type of<br />
unsupervised initialization discussed here can help to select basins of attraction (for the supervised<br />
fine-tuning optimization phase) from which learning good solutions is easier both from the point of<br />
view of the training set and of a test set.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
===Business Applications of Neural Networks===<br />
<br />
Neural networks are increasingly being used in real-world business applications and, in some cases, such as fraud detection, they have already become the method of choice. Their use for risk assessment is also growing and they have been employed to visualize complex databases for marketing segmentation. This method covers a wide range of business interests — from finance management, through forecasting, to production. The combination of statistical, neural and fuzzy methods now enables direct quantitative studies to be carried out without the need for rocket-science expertise.<br />
<br />
* On the Use of Neural Networks for Analysis Travel Preference Data <br />
* Extracting Rules Concerning Market Segmentation from Artificial Neural Networks <br />
* Characterization and Segmenting the Business-to-Consumer E-Commerce Market Using Neural Networks<br />
* A Neurofuzzy Model for Predicting Business Bankruptcy <br />
* Neural Networks for Analysis of Financial Statements <br />
* Developments in Accurate Consumer Risk Assessment Technology <br />
* Strategies for Exploiting Neural Networks in Retail Finance <br />
* Novel Techniques for Profiling and Fraud Detection in Mobile Telecommunications<br />
* Detecting Payment Card Fraud with Neural Networks<br />
* Money Laundering Detection with a Neural-Network <br />
* Utilizing Fuzzy Logic and Neurofuzzy for Business Advantage</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=7367stat841f102010-10-25T21:26:08Z<p>Hclam: /* Multi-Class Logistic Regression & Perceptron - October 19, 2010 */</p>
<hr />
<div>==[[Proposal Fall 2010]] ==<br />
==[[statf10841Scribe|Editor sign up]] ==<br />
{{Cleanup|date=October 8 2010|reason=Provide a summary for each topic here.}}<br />
== Summary ==<br />
=== Classification ===<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>.<br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>.<br />
=== Principle Component Analysis ===<br />
Principal component analysis (PCA) is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf]. Depending on where this methodology is applied, other common names of PCA include the [http://en.wikipedia.org/wiki/Karhunen%E2%80%93Lo%C3%A8ve_theorem Karhunen–Loève transform (KLT)] , the [http://en.wikipedia.org/wiki/Harold_Hotelling Hotelling transform], and the proper orthogonal decomposition (POD). PCA is the simplist [http://en.wikipedia.org/wiki/Eigenvector eigenvector]-based [http://en.wikipedia.org/wiki/Multivariate_analysis multivariate analysis]. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or '''principal components''') of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could.<br />
<br />
==[[f10_Stat841_digest |Digest ]] ==<br />
<br />
== ''' Reference Textbook''' ==<br />
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
== ''' Classification - September 21, 2010''' ==<br />
<br />
=== Classification ===<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which ones were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle (384 BC - 322 BC) when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression], [http://en.wikipedia.org/wiki/Cluster_analysis clustering], and [http://en.wikipedia.org/wiki/Dimension_reduction dimensionality reduction] (feature extraction or manifold learning). Please be noted that some people consider classification to be a broad area that consists of both supervised and unsupervised methods of classifying data. In this view, as can be seen in [http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm this link], clustering is simply a special case of classification and it may be called '''unsupervised classification'''.<br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers, a link to a source of which can be found [http://www.e-knowledge.ca/quotes.php?topic=Knowledge here].<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers <br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that can take a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
<br />
The '''true error rate''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
<br />
In practice, the empirical error rate is obtained to estimate the true error rate, whose value is impossible to be known because the parameter values of the underlying process cannot be known but can only be estimated using available data. The empirical error rate, in practice, estimates the true error rate quite well in that, as mentioned [http://www.liebertonline.com/doi/pdf/10.1089/106652703321825928 here], it is an unbiased estimator of the true error rate.<br />
<br />
=== Bayes Classifier ===<br />
<br />
{{Cleanup|date=October 14 2010|reason=In response to the previous tag: The naive Bayes classifier is a special (simpler) case of the Bayes classifier. It uses an extra assumption: that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. This assumption allows for an easier likelihood function <math>\,f_y(x)</math> in the equation:<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
The simper form of the likelihood function seen in the naive Bayes is:<br />
:<math><br />
\begin{align}<br />
f_y(x) = P(X=x|Y=y) = {\prod_{i=1}^{n} P(X_{i}=x_{i}|Y=y)}<br />
\end{align}<br />
</math><br />
The Bayes classifier taught in class was not the naive Bayes classifier. Perhaps a comment should be made about the naive Bayes classifier in the body of the text}}<br />
<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' Theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers [1]. Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests][2].<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into the most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class <math> y \in \mathcal{Y} </math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the posterior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h^*</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
The Bayes classifier is the optimal classifier in that it results in the least possible true probability of misclassification for any given new data input, i.e., for any generic classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief. As a result, the estimated values of the components in the trained model may deviate quite a bit from their true population values, and this can ultimately cause the calculated posterior probabilities of inputs to deviate quite a bit from their true values. Estimation of all these probability functions, as likelihood, prior probability, and evidence function is a very expensive task, computationally, which also makes some other classifiers more favorable than Bayes classifier.<br />
<br />
A detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here].<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h^*</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h^*)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
'''Theorem'''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course. ''Note: these are the known y values in the training data.'' <br />
<br />
These known data are summarized in the following tables:<br />
<br />
:[[File:裁剪.jpg]]<br />
{{Cleanup|date=September 2010|reason=this graph is not complete, the reason is that it should be in consistent with the computation below.}}<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
<br />
The Bayesian view of probability states that, for any event E, event E has a [http://en.wikipedia.org/wiki/Prior_probability prior probability] that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability of event E's occurrence, can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>. If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow. This is because one cannot possibly carry out trials for any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h^*)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the hyperplane the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and we shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that both of the two classes have [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distributions] and the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> to <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of <math>\,D(h^*)</math> is as follows:<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out alike terms and factoring).<br />
<br />
It is easy to see that, under LDA, the Bayes's classifier's decision boundary <math>\,D(h^*)</math> has the form <math>\,ax+b=0</math> and it is linear in <math>\,x</math>. This is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\left( \mu_m^\top\Sigma^{-1}<br />
\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n) \right)=0</math> . In addition, for any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between the centers (means) of <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice, where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
According to [http://www.lsv.uni-saarland.de/Vorlesung/Digital_Signal_Processing/Summer06/dsp06_chap9.pdf this link], some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that the data in each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - September 23, 2010''' ==<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>.<br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>.<br />
<br />
The derivation of the Bayes classifier's decision boundary <math>\,D(h^*)</math> under QDA is similar to that under LDA. Again, let us first consider the two-classes case where <math>\, \mathcal{Y}=\{0, 1\}</math>. This derivation is given as follows: <br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math> (by cancellation)<br />
:<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_1|)-\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1)+\log(\pi_1)=-\frac{1}{2}\log(|\Sigma_0|)-\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0)+\log(\pi_0)</math> (by taking the log of both sides)<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top\Sigma_1^{-1}x + \mu_1^\top\Sigma_1^{-1}\mu_1 - 2x^\top\Sigma_1^{-1}\mu_1 - x^\top\Sigma_0^{-1}x - \mu_0^\top\Sigma_0^{-1}\mu_0 + 2x^\top\Sigma_0^{-1}\mu_0 \right)=0</math> (by expanding out)<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top(\Sigma_1^{-1}-\Sigma_0^{-1})x + \mu_1^\top\Sigma_1^{-1}\mu_1 - \mu_0^\top\Sigma_0^{-1}\mu_0 - 2x^\top(\Sigma_1^{-1}\mu_1-\Sigma_0^{-1}\mu_0) \right)=0</math> <br />
<br />
It is easy to see that, under QDA, the decision boundary <math>\,D(h^*)</math> has the form <math>\,ax^2+bx+c=0</math> and it is quadratic in <math>\,x</math>. This is where the word ''quadratic'' in quadratic discriminant analysis comes from.<br />
<br />
As is the case with LDA, QDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\log(\frac{|\Sigma_m|}{|\Sigma_n|})-\frac{1}{2}\left( x^\top(\Sigma_m^{-1}-\Sigma_n^{-1})x + \mu_m^\top\Sigma_m^{-1}\mu_m - \mu_n^\top\Sigma_n^{-1}\mu_n - 2x^\top(\Sigma_m^{-1}\mu_m-\Sigma_n^{-1}\mu_n) \right)=0</math>.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h^*(x) = \arg\max_{k} \delta_k(x)</math> <br />
where, <br />
* In the case of LDA, which assumes that a common covariance matrix is shared by all classes, <math> \,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is linear in <math>\,x</math>.<br />
<br />
* In the case of QDA, which assumes that each class has its own covariance matrix, <math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is quadratic in <math>\,x</math>.<br />
<br />
<br />
'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
[http://www.stat.cmu.edu/~larry/=stat707/notes10.pdf See Theorem 46.6 Page 133]<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the Maximum Likelihood estimates from the sample for <math>\,\pi,\mu_k,\Sigma_k</math> in place of their true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance, denoted <math>\Sigma</math>, is defined as the weighted average of the covariance for each class. <br />
<br />
In the case where we need a common covariance matrix, we get the estimate using the following equation:<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{n-k} </math><br />
<br />
Where: <math>\,n_r</math> is the number of data points in class r, <math>\,\Sigma_r</math> is the covariance of class r and <math>\,n</math> is the total number of data points,<br />
<math>\,k</math> is the number of classes.<br />
<br />
See the details about the [http://en.wikipedia.org/wiki/Estimation_of_covariance_matrices estimation of covarience matrices].<br />
<br />
===Computation For QDA And LDA===<br />
<br />
First, let us consider QDA, and examine each of the following two cases.<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
<math>\, \Sigma_k = I </math> for every class <math>\,k</math> implies that our data is spherical. This means that the data of each class <math>\,k</math> is distributed symmetrically around the center <math>\,\mu_k</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{-1}{2}log(|I|)</math>, is zero since <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximize <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = U_kS_kV_k^\top = U_kS_kU_k^\top </math> (In general when <math>\,X=U_kS_kV_k^\top</math>, <math>\,U_k</math> is the eigenvectors of <math>\,X_kX_k^T</math> and <math>\,V_k</math> is the eigenvectors of <math>\,X_k^\top X_k</math>. <br />
So if <math>\, X_k</math> is symmetric, we will have <math>\, U_k=V_k</math>. Here <math>\, \Sigma_k </math> is symmetric, because it is the covariance matrix of <math> X_k </math>) and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (U_kS_kU_k^\top)^{-1} = (U_k^\top)^{-1}S_k^{-1}U_k^{-1} = U_kS_k^{-1}U_k^\top </math> (since <math>\,U_k</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top U_kS_k^{-1}U_k^T(x-\mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-1}(U_k^\top x-U_k^\top \mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top x-U_k^\top\mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the squared Euclidean distance between <math> \, S_k^{-\frac{1}{2}}U_k^\top x </math> and <math>\, S_k^{-\frac{1}{2}}U_k^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>.<br />
<br />
A similar transformation of all the centers can be done from <math>\,\mu_k</math> to <math>\,\mu_k^*</math> where <math> \, \mu_k^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top \mu_k </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math> and <math>\,\mu_k^*</math>, treating them as in Case 1 above.<br />
<br />
{{Cleanup|date=October 18 2010|reason=The sentence above may cause some misleading. In general case, <math>\,\Sigma_k </math> may not be the same . So you can't treat them completely the same as in Case 1 above. You need to compute <math>\, log{|\Sigma_k |} </math> differently. Here is a detailed discussion below:}}<br />
{{Cleanup|date=October 18 2010|reason=The sentence above is right since by transforming<math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>, the new variable variance is <math>I</math>}}<br />
<br />
<br />
Note that when we have multiple classes, we also need to compute <math>\, log{|\Sigma_k|}</math> respectively. Then we compute <math> \,\delta_k </math> for QDA .<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, in another word, have same covariance <math>\,\Sigma_k</math>,else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
{{Cleanup|date=October 18 2010|reason=The statement above may not be true, because in assignment 1, we did do the QDA computation using this approach although the corresponding three covarience matrices are different, the reason why the answer is Yes is as below }}<br />
<br />
The answer is Yes. Consider that you have two classes with different shapes. Given a data point, justify which class this point belongs to. You just do the transformations corresponding to the 2 classes respectively, then you get <math>\,\delta_1 ,\delta_2 </math> ,then you determine which class the data point belongs to by comparing <math> \,\delta_1 </math> and <math> \,\delta_2 </math> .<br />
<br />
In summary, to apply QDA on a data set <math>\,X</math>, in the general case where <math>\, \Sigma_k \ne I </math> for each class <math>\,k</math>, one can proceed as follows:<br />
<br />
:: Step 1: For each class <math>\,k</math>, apply singular value decomposition on <math>\,X_k</math> to obtain <math>\,S_k</math> and <math>\,U_k</math>.<br />
<br />
:: Step 2: For each class <math>\,k</math>, transform each <math>\,x</math> belonging to that class to <math>\,x^* = S_k^{-\frac{1}{2}}U_k^\top x</math>, and transform its center <math>\,\mu_k</math> to <math>\,\mu_k^* = S_k^{-\frac{1}{2}}U_k^\top \mu_k</math>.<br />
<br />
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x^*</math> and the transformed center <math>\,\mu^*</math> of each class, and assign <math>\,x</math> to the class such that the squared Euclidean distance between <math>\,x^*</math> and <math>\,\mu^*</math> is the least over all of the classes.<br />
<br />
<br />
Now, let us consider LDA. <br />
Here, one can derive a classification scheme that is quite similar to that shown above. The main difference is the assumption of a common variance across the classes, so we perform the Singular Value Decomposition once, as opposed to k times.<br />
<br />
To apply LDA on a data set <math>\,X</math>, one can proceed as follows:<br />
<br />
:: Step 1: Apply singular value decomposition on <math>\,X</math> to obtain <math>\,S</math> and <math>\,U</math>.<br />
<br />
:: Step 2: For each <math>\,x \in X</math>, transform <math>\,x</math> to <math>\,x^* = S^{-\frac{1}{2}}U^\top x</math>, and transform each center <math>\,\mu</math> to <math>\,\mu^* = S^{-\frac{1}{2}}U^\top \mu</math>.<br />
<br />
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x^*</math> and the transformed center <math>\,\mu^*</math> of each class, and assign <math>\,x</math> to the class such that the squared Euclidean distance between <math>\,x^*</math> and <math>\,\mu^*</math> is the least over all of the classes.<br />
<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In actual data scenarios, it is generally true that QDA will provide a better classifier for the data then LDA because QDA does not assume that the covariance matrix for each class is identical, as LDA assumes. However, QDA still assumes that the class conditional distribution is Gaussian, which is not always the case in real-life scenarios. The link provided at the beginning of this paragraph describes a kernel-based QDA method which does not have the Gaussian distribution assumption.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA - September 28, 2010==<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension. Pay attention, We don't do QDA with LDA. If we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below applies LDA to the same data set and reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that the algorithm created to separate the data into the two classes.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the classes of the points 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x*y+%g*y^2', k, l(1), l(2), q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that QDA is only correct in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve produced by QDA that do not lie on the correct side of the line produced by LDA.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Principal Component Analysis - September 30, 2010==<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br />
<br /><br />
Principal component analysis (PCA) is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf]. Depending on where this methodology is applied, other common names of PCA include the [http://en.wikipedia.org/wiki/Karhunen%E2%80%93Lo%C3%A8ve_theorem Karhunen–Loève transform (KLT)] , the [http://en.wikipedia.org/wiki/Harold_Hotelling Hotelling transform], and the proper orthogonal decomposition (POD). PCA is the simplist [http://en.wikipedia.org/wiki/Eigenvector eigenvector]-based [http://en.wikipedia.org/wiki/Multivariate_analysis multivariate analysis]. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or '''principal components''') of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could. <br />
<br />
<br />
Furthermore, if one considers the lower dimensional representation produced by PCA as a least squares fit of our original data, then it can also be easily shown that this representation is the one that minimizes the reconstruction error of our data. It should be noted however, that one usually does not have control over which dimensions PCA deems to be the most informative for a given set of data, and thus one usually does not know which dimensions PCA selects to be the most informative dimensions in order to create the lower-dimensional representation. <br />
<br />
<br />
Suppose <math>\,X</math> is our data matrix containing <math>\,d</math>-dimensional data. The idea behind PCA is to apply [http://en.wikipedia.org/wiki/Singular_value_decomposition singular value decomposition] to <math>\,X</math> to replace the rows of <math>\,X</math> by a subset of it that captures as much of the [http://en.wikipedia.org/wiki/Variance variance] in <math>\,X</math> as possible. First, through the application of singular value decomposition to <math>\,X</math>, PCA obtains all of our data's directions of variation. These directions would also be ordered from left to right, with the leftmost directions capturing the most amount of variation in our data and the rightmost directions capturing the least amount. Then, PCA uses a subset of these directions to map our data from its original space to a lower-dimensional space. <br />
<br />
<br />
By applying singular value decomposition to <math>\,X</math>, <math>\,X</math> is decomposed as <math>\,X = U\Sigma V^T \,</math>. The <math>\,d</math> columns of <math>\,U</math> are the [http://en.wikipedia.org/wiki/Eigenvector eigenvectors] of <math>\,XX^T \,</math>.<br />
The <math>\,d</math> columns of <math>\,V</math> are the eigenvectors of <math>\,X^TX \,</math>. The <math>\,d</math> diagonal values of <math>\,\Sigma</math> are the square roots of the [http://en.wikipedia.org/wiki/Eigenvalue eigenvalues] of <math>\,XX^T \,</math> (also of <math>\,X^TX \,</math>), and they correspond to the columns of <math>\,U</math> (also of <math>\,V</math>). <br />
<br />
<br />
We are interested in <math>\,U</math>, whose <math>\,d</math> columns are the <math>\,d</math> directions of variation of our data. Ordered from left to right, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most informative direction of variation of our data. That is, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most effective column in terms of capturing the total variance exhibited by our data. A subset of the columns of <math>\,U</math> is used by PCA to reduce the dimensionality of <math>\,X</math> by projecting <math>\,X</math> onto the columns of this subset. In practice, when we apply PCA to <math>\,X</math> to reduce the dimensionality of <math>\,X</math> from <math>\,d</math> to <math>\,k</math>, where <math>k < d\,</math>, we would proceed as follows:<br />
<br />
:: Step 1: Center <math>\,X</math> so that it would have zero mean.<br />
<br />
:: Step 2: Apply singular value decomposition to <math>\,X</math> to obtain <math>\,U</math>.<br />
<br />
:: Step 3: Suppose we denote the resulting <math>\,k</math>-dimensional representation of <math>\,X</math> by <math>\,Y</math>. Then, <math>\,Y</math> is obtained as <math>\,Y = U_k^TX</math>. Here, <math>\,U_k</math> consists of the first (leftmost) <math>\,k</math> columns of <math>\,U</math> that correspond to the <math>\,k</math> largest diagonal elements of <math>\,\Sigma</math>.<br />
<br />
<br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 130 images (28 by 23 pixels) of handwritten threes. <br />
{{Cleanup|date=September 6 2010|reason=If anyone can tell me where I can find the 2-3 data set, I would create the new image. In the mean time, I found a non-copyrighted image of different looking 3s online, but as you can see, it is not as nice as one we could make.}}<br />
{{Cleanup|date=September 6 2010|reason=I think you can find it on your UW-ACE account for this course.}}<br />
<br />
[[File:Handwritten 3s.gif]]<br />
<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 130 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[Image:23plotPCA.jpg]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slant of each handwritten digit.<br />
<br />
===Derivation of the first Principle Component===<br />
{{Cleanup|date=October 2010|reason=I think English of this section must be improved}}<br />
We want to find the direction of maximum variation. Let <math>\begin{align}\textbf{w}\end{align}</math> be an arbitrary direction, <math>\begin{align}\textbf{x}\end{align}</math> a data point and <math>\begin{align}\displaystyle u\end{align}</math> the length of the projection of <math>\begin{align}\textbf{x}\end{align}</math> in direction <math>\begin{align}\textbf{w}\end{align}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\begin{align}\textbf{w}\end{align}</math> is the same as <math>\begin{align}c\textbf{w}\end{align}</math>, for any scalar <math>c</math>, so without loss of generality, we assume that: <br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}.<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}. <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} .</math><br />
<br /><br /><br />
The above is the variance of <math>\begin{align}\displaystyle u \end{align}</math> formed by the weight vector <math>\begin{align}\textbf{w} \end{align}</math>. The first principal component is the vector <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the function. Our goal is to find the weight <math>\begin{align}\textbf{w} \end{align}</math> that maximizes this variability, subject to a constraint. Since our function is convex, it has no maximum value. Therefore we need to add a constraint that restricts the length of <math>\begin{align}\textbf{w} \end{align}</math>. However, we are only interested in the direction of the variability, so the problem becomes<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
s.t. <math>\textbf{w}^T \textbf{w} = 1.</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|.<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
====Lagrange Multiplier====<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector then the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
{{Cleanup|date=October 2010|reason=It is good discussion, what will happen if we don't have distinct eigenvalues and eigenvectors? What does this situation mean? }}<br />
{{Cleanup|date=October 2010|reason=If the eigenvalues are not distinct, I suppose we could still take the leftmost eigenvector by default. Not sure if this is the correct approach, so can anyone please explain further? Thanks }}<br />
{{Cleanup|date=October 2010|reason= As U is the eigenvector of a symetric matrix, is it possible that we have 2 similar eigen vector? }}<br />
<br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = Var(\sum_{i = 1}^n x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
m_X=mean(X,2);<br />
mm=repmat(m_X,1,300);<br />
XX=X-mm;<br />
[u s v] = svd(XX);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
xHat=xHat+mm;<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. This is because almost all of the noise in the noisy image is captured by the principal components (directions of variation) that capture the least amount of variation in the image, and these principal components were discarded when we used the few principal components that capture most of the image's variation to generate the image's lower-dimensional representation. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Extraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data.<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized as follows (taken from the Lecture Slides).<br />
<br />
====Algorithm ====<br />
'''Recover basis:''' Calculate <math> XX^T =\Sigma_{i=1}^{n} x_i x_{i}^{T} </math> and let <math> U=</math> eigenvectors of <math> X X^T </math> corresponding to the top <math> d </math> eigenvalues.<br />
<br />
'''Encoding training data:''' Let <math>Y=U^TX </math> where <math>Y</math> is a <math>d \times n</math> matrix of encoding of the original data.<br />
<br />
'''Reconstructing training data:''' <math>\hat{X}= UY=UU^TX </math>.<br />
<br />
'''Encode set example:''' <math> y=U^T x </math> where <math> y </math> is a <math>d-</math>dimentional encoding of <math>x</math>.<br />
<br />
'''Reconstruct test example:''' <math>\hat{x}= Uy=UU^Tx </math>.<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{D \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, U_dY_{d \times n} </math>, where <math>\,U_d</math> contains the first (leftmost) <math>\,d</math> columns of <math>\,U</math>.<br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
== Fisher's (Linear) Discriminant Analysis (FDA) - Two Class Problem - October 5, 2010 ==<br />
<br />
===Sir Ronald A. Fisher===<br />
Fisher's Discriminant Analysis (FDA), also known as Fisher's Linear Discriminant Analysis (LDA) in some sources, is a classical feature extraction technique. It was originally described in 1936 by Sir [http://en.wikipedia.org/wiki/Ronald_A._Fisher Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here]. <br />
In this paper Fisher used for the first time the term DISCRIMINANT FUNCTION. The term DISCRIMINANT ANALYSIS was introduced later by Fisher himself in a subsequent paper which can be found [http://digital.library.adelaide.edu.au/coll/special//fisher/155.pdf here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
As in PCA, the goal of FDA is to project the data in a lower dimension. You might ask, why was FDA invented when PCA already existed? There is a simple explanation for this that can be found [http://www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf here]. PCA is an unsupervised method for classification, so it does not take into account the labels in the data. Suppose we have two clusters that have very different or even opposite labels from each other but are nevertheless positioned in a way such that they are very much parallel to each other and also very near to each other. In this case, most of the total variation of the data is in the direction of these two clusters. If we use PCA in cases like this, then both clusters would be projected onto the direction of greatest variation of the data to become sort of like a single cluster after projection. PCA would therefore mix up these two clusters that, in fact, have very different labels. What we need to do instead, in this cases like this, is to project the data onto a direction that is orthogonal to the direction of greatest variation of the data. This direction is in the least variation of the data. On the 1-dimensional space resulting from such a projection, we would then be able to effectively classify the data, because these two clusters would be perfectly or nearly perfectly separated from each other taking into account of their labels. This is exactly the idea behind FDA. <br />
<br />
The main difference between FDA and PCA is that, in FDA, in contrast to PCA, we are not interested in retaining as much of the variance of our original data as possible. Rather, in FDA, our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for a direction that is most representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
Suppose we have 2-dimensional data, then FDA would attempt to project the data of each class onto a point in such a way that the resulting two points would be as far apart from each other as possible. Intuitively, this basic idea behind FDA is the optimal way for separating each pair of classes along a certain direction. <br />
<br />
{{Cleanup|date=October 2010|reason= Just a thought: how relevant is "Dimensionality reduction techniques" to the concept of "subspace clustering"? As in subspace clustering, the goal is to find a set of features (relevant features, the concept is referred to as local feature relevance in the literature) in the high dimensional space, where potential subspaces accommodating different classes of data points can be defined. This means; the data points are dense when they are considered in a subset of dimensions (features).}}<br />
{{Cleanup|date=October 2010|reason=If I'm not mistaken, classification techniques like FDA use labeled training data whereas clustering techniques use unlabeled training data instead. Any other input regarding this would be much appreciated. Thanks}}<br />
{{Cleanup|date=October 2010|reason=An extension of clustering is subspace clustering in which different subspace are searched through to find the relavant and appropriate dimentions. High dimentional data sets are roughly equiedistant from each other, so feature selection methods are used to remove the irrelavant dimentions. These techniques do not keep the relative distance so PCA is not useful for these applications. It should be noted that subspace clustering localize their search unlike feature selection algorithms.for more information click here[http://portal.acm.org/citation.cfm?id=1007731]}}<br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes:<br />
<br><br />
For a 2-classes problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k-classes problem, we want to reduce the data to k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within-variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
{{Cleanup|date=October2010|reason=Anyone please add an example to make the comparison clearer}}<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. In the paper,"[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] " written by our instructor, they propose a closed-form solution to one algorithm that previously required expensive semidefinite optimization. They provide a new problem setup in which the algorithm performs better or as well as some standard methods, but without the computational complexity. Furthermore, they show a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). They also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
===FDA Goals===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
==== Example in R ====<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
<br />
FDA projects the data into lower dimensional space, where the distances between the projected means are maximum and the within-class variances are minimum. There are two categories of classification problems:<br />
<br />
1. Two-class problem<br />
<br />
2. Multi-class problem (addressed next lecture)<br />
<br />
=== Two-class problem ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within each class''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances (the summation of the two covariances is a valid covariance, satisfying the symmetry and positive semi-definite criteria). <br />
<br />
{{Cleanup|date=October 2010|reason=In 2. above, I wonder if the computation would be much more complex if we instead find a weighted sum of the covariances of the two classes where the weights are the sizes of the two classes?}}<br />
<br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> where <math>\ z_i </math> is a scalar<br />
<br />
====1. Minimizing within-class variance==== <br />
<math>\displaystyle \min_w (\underline{w}^T\sum_1\underline{w}) </math><br />
<br />
<math>\displaystyle \min_w (\underline{w}^T\sum_2\underline{w}) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math><br />
<br> (where <math>\,\sum_1</math> and <math>\,\sum_2 </math> are the covariance matrices of the 1st and 2nd classes of data, respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within-classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math>.<br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br /> <br />
<math>\displaystyle \max_w ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2, </math><br />
<br /><br /><br />
<math>\begin{align} ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2 &= (\underline{w}^T \mu_1 - \underline{w}^T \mu_2)^T(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\<br />
&= (\mu_1^T\underline{w} - \mu_2^T\underline{w})(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\<br />
&= (\mu_1 - \mu_2)^T \underline{w} \underline{w}^T (\mu_1 - \mu_2) \\<br />
<br />
&= ((\mu_1 - \mu_2)^T \underline{w})^{T} (\underline{w}^T (\mu_1 - \mu_2))^{T} \\<br />
&= \underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T \underline{w} \end{align}</math><br /><br />
<br />
Note that in the last line above the order is rearranged clockwise because the answer is a scalar.<br />
<br />
Let <math>\displaystyle s_B = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math>, the between-class covariance, then the goal is to <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math>.<br />
<br />
===The Objective Function for FDA===<br />
We want an objective function which satisfies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math> or <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math><br />
# <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math> <br />
<br /><br /><br />
So, we construct our objective function as maximizing the ratio of the two goals brought above:<br /><br />
<br /><br />
<math>\displaystyle \max_w \frac{(\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w})} {(\underline{w}^T(\sum_1 + \sum_2)\underline{w})} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max_w \frac{(\underline{w}^Ts_B\underline{w})}{(\underline{w}^Ts_w\underline{w})}</math> <br /><br />
One may argue that we can use subtraction for this purpose, while this approach is true but it can be shown it will need another scaling factor. Thus using this ratio is more efficient.<br />
<br />
As the objective function is convex, and so it does not have a maximum. To get around this problem, we have to add the constraint that w must have unit length, and then solvethis optimization problem we form the lagrangian:<br />
<br />
<br /><br /><br />
<math>\displaystyle L(\underline{w},\lambda) = \underline{w}^Ts_B\underline{w} - \lambda (\underline{w}^Ts_w\underline{w} -1)</math><br /><br /><br />
<br />
<br /><br />
Then, we equate the partial derivative of L with respect to <math>\underline{w}</math>:<br />
<math>\displaystyle \frac{\partial L}{\partial \underline{w}}=2s_B \underline{w} - 2\lambda s_w \underline{w} = 0 </math> <br /><br />
<br />
<math>s_B \underline{w} = \lambda s_w \underline{w}</math><br /><br />
<math>s_w^{-1}s_B \underline{w}= \lambda\underline{w}</math><br /><br /><br />
This is in the form of generalized eigenvalue problem. Therefore, <math> \underline{w}</math> is the largest eigenvector of <math>s_w^{-1}s_B </math><br /><br />
<br />
This solution can be further simplified as follow:<br /><br />
<br />
<math>s_w^{-1}(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w} = \lambda\underline{w} </math><br /><br />
<br />
Since <math>(\mu_1 - \mu_2)^T\underline{w}</math> is a scalar then <math>s_w^{-1}(\mu_1 - \mu_2)</math>∝<math>\underline{w}</math> <br /><br /><br />
This gives the direction of <math>\underline{w}</math> without doing eigenvalue decomposition in the case of 2-class problem.<br />
<br />
Note: In order for <math>{s_w}</math> to have an inverse, it must have full rank. This can be achieved by ensuring that the number of data points <math>\,\ge</math> the dimensionality of <math>\underline{x_{i}}</math>.<br />
<br />
===FDA Using Matlab===<br />
Note: ''The following example was not actually mentioned in this lecture''<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods. <br />
%First of all, we generate the two data set:<br />
% First data set X1<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300);<br />
%In this case: <br />
mu_1=[1;1]; <br />
Sigma_1=[1 1.5; 1.5 3]; <br />
%where mu and sigma are the mean and covariance matrix.<br />
% Second data set X2<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300); <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
plot(X1(:,1),X1(:,2),'.b'); hold on;<br />
plot(X2(:,1),X2(:,2),'ob')<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
% Combine data sets to map both into the same subspace<br />
X=[X1;X2];<br />
X=X';<br />
% We used built-in PCA function in Matlab<br />
[coefs, scores]=princomp(X);<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=sw\[4; 2] % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is very little overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
===Some of FDA applications===<br />
There are many applications for FDA in many domains some of them are stated below:<br />
<br />
* SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS<br />
FDA can be used to enhance listening comprehension when the user goes from a sound<br />
environment to another different one. For more information review this paper by Alexandre et al.[http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/papers/1569101740.pdf here]<br />
<br />
* Application to Face Recognition<br />
FDA can be used in face recognition at different situation. Using FDA Kong et al. proposes an Application to Face<br />
Recognition with Small Number of Training Samples [http://person.hst.aau.dk/pimuller/2D_FDA_Face_CVPR05fish.pdf here].<br />
<br />
* Palmprint Recognition<br />
FDA is used in biometrics, to implement an automated palmprint recognition system. See An Automated Palmprint Recognition System by Tee et al. [http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6V09-4FJ5XPN-1-1&_cdi=5641&_user=1067412&_pii=S0262885605000089&_origin=search&_coverDate=05%2F01%2F2005&_sk=999769994&view=c&wchp=dGLbVzz-zSkWb&md5=a064b67c9bdaaba7e06d800b6c9b209b&ie=/sdarticle.pdf here].<br />
<br />
{{Cleanup|date=October 2010|reason=I think briefing about the other applications would be easier than browsing through all of these applications}}<br />
<br />
{{Cleanup|date=October 2010|reason= This link is no longer valid.}}<br />
<br />
other applications could found in references 4,5,6,7,8 and more in [http://www.sciencedirect.com/science?_ob=ArticleListURL&_method=list&_ArticleListID=1489148820&_sort=r&_st=13&view=c&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=f210273546a659c90ae0962fce7b8b4e&searchtype=a here]<br />
<br />
=== '''References'''===<br />
1. Kong, H.; Wang, L.; Teoh, E.K.; Wang, J.-G.; Venkateswarlu, R.; , "A framework of 2D Fisher discriminant analysis: application to face recognition with small number of training samples," Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , vol.2, no., pp. 1083- 1088 vol. 2, 20-25 June 2005<br />
doi: 10.1109/CVPR.2005.30<br />
[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1467563&isnumber=31473 1]<br />
<br />
2. Enrique Alexandre, Roberto Gil-Pita, Lucas Cuadra, Lorena A´lvarez, Manuel Rosa-Zurera, "SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS USING A TWO-LAYER CLASSIFICATION SYSTEM WITH MSE LINEAR DISCRIMINANTS", 16th European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland, August 25-29, 2008, copyright by EURASIP, [http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/welcome.html 2]<br />
<br />
3. Connie, Tee; Jin, Andrew Teoh Beng; Ong, Michael Goh Kah; Ling, David Ngo Chek; "An automated palmprint recognition system", Journal of Image and Vision Computing, 2005. [http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V09-4FJ5XPN-1&_user=1067412&_coverDate=05/01/2005&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_searchStrId=1489147048&_rerunOrigin=google&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=a781a68c29fbf127473ae9baa5885fe7&searchtype=a 3]<br />
<br />
4. met, Francesca; Boqué, Ricard; Ferré, Joan; "Application of non-negative matrix factorization combined with Fisher's linear discriminant analysis for classification of olive oil excitation-emission fluorescence spectra", Journal of Chemometrics and Intelligent Laboratory Systems, 2006.<br />
[http://www.sciencedirect.com/science/article/B6TFP-4HR769Y-1/2/b5244d459265abb3a1bf5238132c737e 4]<br />
<br />
5. Chiang, Leo H.;Kotanchek, Mark E.;Kordon, Arthur K.; "Fault diagnosis based on Fisher discriminant analysis and support vector machines"<br />
Journal of Computers & Chemical Engineering, 2004<br />
[http://www.sciencedirect.com/science/article/B6TFT-4B4XPRS-1/2/bca7462236924d29ea23ec633a6eb236 5]<br />
<br />
6. Yang, Jian ;Frangi, Alejandro F.; Yang, Jing-yu; "A new kernel Fisher discriminant algorithm with application to face recognition", 2004<br />
[http://www.sciencedirect.com/science/article/B6V10-4997WS1-1/2/78f2d27c7d531a3f5faba2f6f4d12b45 6]<br />
<br />
7. Cawley, Gavin C.; Talbot, Nicola L. C.; "Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers", Journal of Pattern Recognition , 2003 [http://www.sciencedirect.com/science/article/B6V14-492718R-1/2/bd6e5d0495023a1db92ab7169cc96dde 7]<br />
<br />
8. Kodipaka, S.; Vemuri, B.C.; Rangarajan, A.; Leonard, C.M.; Schmallfuss, I.; Eisenschenk, S.; "Kernel Fisher discriminant for shape-based classification in epilepsy" Hournal Medical Image Analysis, 2007. [http://www.sciencedirect.com/science/article/B6W6Y-4MH8BS0-1/2/055fb314828d785a5c3ca3a6bf3c24e9 8]<br />
<br />
==Fisher's (Linear) Discriminant Analysis (FDA) - Multi-Class Problem - October 7, 2010==<br />
<br />
====Obtaining Covariance Matrices====<br />
<br />
<br />
The within-class covariance matrix <math>\mathbf{S}_{W}</math> is easy to obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between-class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to compute directly. To bypass this problem we use the following method. We know that the total covariance <math>\,\mathbf{S}_{T}</math> of a given set of data is constant and known, and we can also decompose this variance into two parts: the within-class variance <math>\mathbf{S}_{W}</math> and the between-class variance <math>\mathbf{S}_{B}</math> in a way that is similar to [http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA]. We thus have:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
where the total variance is given by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = <br />
\frac{1}{n}<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
We can now get <math>\mathbf{S}_{B}</math> from the relationship: <br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
<br />
Actually, there is another way to obtain <math>\mathbf{S}_{B}</math>. Suppose the data contains <math>\, k </math> classes, and each class <math>\, j </math> contains <math>\, n_{j} </math> data points. We denote the overall mean vector by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\frac{1}{n} \sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within-class covariance <math>\mathbf{S}_{W}</math><br />
and the between-class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term in the final line of the derivation above as the between-class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^* =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})^{T})<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))((\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W})<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the following as our measure:<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to the largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
{{Cleanup|date=What if we encounter complex eigenvalues? Then concept of being large does not dense. What is the solution in that case? }}<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
<br />
Recall that the Frobenius norm of <math>X</math> is <br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2_{2} = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have the following classic criterion function that Fisher used<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to the largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
{{Cleanup|date=October 2010|reason=Adding more general comments about the advantages and flaws of FDA would be effective here.}}<br />
<br />
{{Cleanup|date=October 2010|reason=Would you please show how could we reconstruct our original data from the data that its dimentionality is reduced by FDA.}}<br />
{{Cleanup|date=October 2010|reason= When you reduce the dimensionality of data in most general form you lose some features of the data and you cannot reconstruct the data from redacted space unless the data have special features that help you in reconstruction like sparsity. In FDA it seems that we cannot reconstruct data in general form using reducted version of data }}<br />
<br />
===Generalization of Fisher's Linear Discriminant Analysis ===<br />
<br />
Fisher's Linear Discriminant Analysis (Fisher, 1936) is very popular among users of discriminant analysis. Some of the reasons for this are its simplicity<br />
and lack of necessity for strict assumptions. However, it has optimality properties only if the underlying distributions of the groups are multivariate normal. It is also easy to verify that the discriminant rule obtained can be very harmed by only a small number of outlying observations. Outliers are very hard to detect in multivariate data sets and even when they are detected simply discarding them is not the most efficient way of handling the situation. Therefore, there is a need for robust procedures that can accommodate the outliers and are not strongly affected by them. Then, a generalization of Fisher's linear discriminant algorithm [[http://www.math.ist.utl.pt/~apires/PDFs/APJB_RP96.pdf]] is developed to lead easily to a very robust procedure.<br />
<br />
Also notice that LDA can be seen as a dimensionality reduction technique. In general k-class problems, we have k means which lie on a linear subspace with dimension k-1. Given a data point, we are looking for the closest class mean to this point. In LDA, we project the data point to the linear subspace and calculate distances within that subspace. If the dimensionality of the data, d, is much larger than the number of classes, k, then we have a considerable drop in dimensionality from d dimensions to k - 1 dimensions.<br />
<br />
==Linear and Logistic Regression - October 12, 2010==<br />
<br />
===Linear Regression===<br />
Linear regression is an approach for modeling the scalar value <math>\, y</math> from a set of dependent variables <math>\,X</math>. In linear regression the goal is to find an appropriate set of dependent variables to <math>\, y</math> and try to estimate its value from the related set. While in classification the goal is to classify data to different groups in which the inner similarity among the group members are more than variables which belong to different groups.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
According to Bayes Classification we estimate the posterior as,<br/><br />
<math>P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{l}f_{l}(x)\pi_{l}}</math><br/><br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The simple linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
y_i = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
and we can denote it as<br />
:<math><br />
\begin{align}<br />
\mathbf{y} = \beta^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
where <math>\,\beta^{T} = (<br />
\beta_1,..., \beta_{d},\beta_0)</math> is a <math>1 \times (d+1)</math> vector and <math>\mathbf{X}=<br />
\begin{pmatrix}<br />
\mathbf{x}_{1}, \dots,\mathbf{x}_{n}\\<br />
1, \dots, 1<br />
\end{pmatrix}<br />
</math> is a <math>(d+1) \times n</math> matrix, here <math>\mathbf{x}_{i} </math> is a <math>d \times 1</math> vector<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{n}</math> and <math>\,y_{1}, ..., y_{n}</math> our goal is to find <math>\,\beta^{T}</math> such that the linear model fits the data while minimizing sum of squared errors using the [http://en.wikipedia.org/wiki/Least_squares Least Squares method].<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
We then try to minimize the residual sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\beta^{T}\mathbf{X})^{T}(\mathbf{y}-\beta^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}(\mathbf{y}-\beta^{T}\mathbf{X})^{T}<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}(\mathbf{y}^{T}-\mathbf{X}^{T}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}\mathbf{y}^{T}<br />
\end{align}<br />
</math><br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \hat\beta^{T}\mathbf{X} = <br />
\mathbf{y}\mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X} </math> is called the [http://en.wikipedia.org/wiki/Hat_matrix hat matrix].<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{l}f_{l}(x)\pi_{l}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1. This is more direct approach to classification since it do not need to estimate <math>\ f_k(x) </math> and <math>\ \pi_k </math>.<br />
<math>\ 1 \times P(Y=1|X=x)+0 \times P(Y=0|X=x)=E(Y|X) </math>. <br />
This model does not classify Y between 0 and 1, so it is not good but at times it can lead to a decent classifier. <math>\ y_i=\frac{1}{n_1} </math> <math>\ \frac{-1}{n_2} </math><br />
<br />
===Logistic Regression===<br />
The [http://en.wikipedia.org/wiki/Logistic_regression logistic regression] model arises from the desire to model the posterior probabilities of the <math>\displaystyle K</math> classes via linear functions in <math>\displaystyle x</math>, while at the same time ensuring that they sum to one and remain in [0,1].Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood ,using <math>\displaystyle Pr(Y|X)</math>. Since <math>\displaystyle Pr(Y|X)</math> completely specifies the conditional distribution, the multinomial distribution is appropriate. This model is widely used in biostatistical applications for two classes. For instance: people survive or die, have a disease or not, have a risk factor or not.<br />
<br />
==== logistic function ====<br />
[[File:200px-Logistic-curve.svg.png | Logistic Sigmoid Function]]<br />
<br />
<br />
<br />
A [http://en.wikipedia.org/wiki/Logistic_function logistic function] or logistic curve is the most common sigmoid curve. <br />
<br />
1. <math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
2. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
3. <math>y(0) = \frac{1}{2}</math><br />
<br />
4. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
5. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
====Intuition behind Logistic Regression====<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
====The Logistic Regression Model====<br />
<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
{{Cleanup|date=October 13 2010|reason=It could be useful to have sources for these graphs. We don't know if they are copyrighted}}<br />
{{Cleanup|date=October 18 2010|reason=I Could not find any source for these graphs. However, they following the definition of the defined probability. I don't think the generated graph as it is here is copyrighted, but if you worried you can draw this figure by applying the function and post the result.}}<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
{{Cleanup|date=October 13 2010|reason=It could be useful to have sources for these graphs. We don't know if they are copyrighted}}<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
====Fitting a Logistic Regression====<br />
Logistic regression tries to fit a distribution. The common practice in statistics is to fit density function, posterior density of each class(Pr(Y|X), to data using [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence and identical distribution)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
{{Cleanup|date=October 13 2010|reason=I think, in the following, y_i * x_i and the single x_i on the right side should both be transposed by matrix calculus?}}<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math> <br />
<br />
There are n+1 nonlinear equations in <math> \beta </math>. The first column is vector 1, then <math>\ \sum_{i=1}^n {y_i} =\sum_{i=1}^n p(\underline{x}_i;\underline{\beta}) </math> i.e. the expected number of class ones matches the observed number.<br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
====Extension====<br />
<br />
* When we are dealing with a problem with more than two classes, we need to generalize our logistic regression to a [http://en.wikipedia.org/wiki/Multinomial_logit Multinomial Logit model].<br />
<br />
* Limitations of Logistic Regression:<br />
:1. We know that there is no assumptions made about the distributions of the features of the data (i.e. the explanatory variables). However, the features should not be highly correlated with one another because this could cause problems with estimation.<br />
:2. Large number of data points (i.e.the sample sizes) are required for logistic regression to provide sufficient numbers in both classes. The more number of features/dimensions of the data, the larger the sample size required.<br />
<br />
==Lecture summary==<br />
{{Cleanup|date=October 18 2010|reason=Can anybody provide a better lecture summary? The one below is to just get it started}}<br />
In this lecture an introduction of the linear regression was presented as well as defining the density function for two-class problem. Maximum likelihood was used to define the distribution parameters (i.e. fitting density function to the logistic class.<br />
<br />
== Logistic Regression Cont. - October 14, 2010 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Estimating Parameters <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
'''Newton-Raphson Algorithm:'''<br /><br />
<br />
If we want to find <math>\ x^* </math> such that <math>\ f(x^*)=0</math><br />
<br />
We first pick a starting point <math>x^* = x^{old}</math> and and we solve:<br />
<br \><br />
<br />
<math>\ x^{*} \leftarrow x^{old}-\frac {f(x^{old})}{\partial f(x^{old})} </math> <br /><br />
<math> \ x^{old} \leftarrow x^{*}</math> <br />
<br /><br />
This is repeated till convergence <br />
<br />
If we want to maximize or minimize <math>\ f(x) </math>, then solve for <math>\ \partial f(x)=0 </math><br />
<br />
<math>\ X^{new} \leftarrow x^{old}-\frac {\partial f(x^{old})}{\partial^2 f(x^{old})} </math><br />
<br />
<br /><br />
<br />
In vector notation the above can be written as <br /><br />
<br />
<math><br />
X^{new} \leftarrow X^{old} - H^{-1}\Delta<br />
</math><br />
<br /><br />
H is the [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix] or second derivative matrix and <math>\Delta</math> is the gradient both evaluated at <math>X^{old}</math> <br />
<br /><br />
<br />
'''note:''' If the Hessian is not invertible the [http://en.wikipedia.org/wiki/Generalized_inverse generalized inverse] or pseudo inverse can be used<br />
<br /><br />
<br /><br />
<br />
<br />
As shown above ,the [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or Hessian.<br />
<br />
<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}_i))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{(d+1)}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i,i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^{T}\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
This is an adjusted response and it is solved repeatedly when <math>\ p </math>, <math>\ W </math>, and <math>\ z </math> changes. This algorithm is called [http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares iteratively reweighted least squares] because it solves the weighted least squares problem repeatedly.<br />
<br />
Recall that linear regression by least squares finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}^{T}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow arg \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta}^T)</math><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_{i,i}</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They both attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They both have linear boundaries.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1. There exists a closed form solution for least squares.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1. No closed form solution exists<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quadratically w.r.t dimension.<br />
#LDA estimate parameters more efficiently by using more information about data and samples without class labels can be also used in LDA. <br />
<br />
{{Cleanup|date=October 2010|reason= Could somebody please validate the following points}} <br />
{{Cleanup|date=October 2010|reason= I'm not too sure about the first point either, but it seems reasonable to me. Would be great if someone can confirm this point. Thanks}} <br />
<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust. (For high dimensionality logistic regression is more accommodating)<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
#Logistic regression is more robust, because it does not assume normal distribution regarding each independent variable.<br />
<br />
Many other advantages of logistic regression are explained [http://www.statgun.com/tutorials/logistic-regression.html here].<br />
<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>\,(d+1)</math><math>\,\times</math><math>\,(k-1)</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
===Lecture Summary===<br />
<br />
Traditionally logistic regression parameters are estimated using maximum likelihood. However , other optimization techniques may be used as well.<br />
<br /><br />
Since there is no closed form solution for finding the zero of the first derivative of the log likelihood the Newton Raphson algorithm is used. Since the problem is convex Newtons is guaranteed to converge to a global optimum.<br />
<br /><br />
Logistic regression requires less parameters than LDA or QDA and is therefore more favorable for high dimensional data.<br />
<br />
===Supplements===<br />
<br />
A detailed proof that logistic regression is convex is available [http://people.csail.mit.edu/jrennie/writing/convexLR.pdf here]. See '1 Binary LR' for the case we discussed in lecture.<br />
<br />
== '''Multi-Class Logistic Regression & Perceptron - October 19, 2010''' ==<br />
<br />
=== Lecture Summary ===<br />
<br />
In this lecture, the topic of logistic regression was finalized by covering the multi-class logistic regression and a new topic on perceptron was introduced. Perceptron is a linear classifier for two-class problems. The main goal of perceptron is classify data in 2 classes by minimizing the distances between the misclassified points and the decision boundary. This will be continued in the following lectures.<br />
<br />
=== Multi-Class Logistic Regression ===<br />
Recall that in two-class logistic regression, the posterior probability of one of the classes (say class 0) is modeled by a function in the form shown in figure 1. <br />
<br />
The posterior probability of the second class (say class 1) is the complement of the first class (class 0). <br /><br /><br />
<math>\displaystyle P(Y=0 | X=x) = 1 - P(Y=1 | X=x)</math><br /><br />
<br />
This function is called sigmoid logistic function, which is the reason why this algorithm is called "logistic regression".<br />
[[File:Picture1.png|150px|thumb|right|<math>Fig.1: P(Y=1 | X=x)</math>]]<br />
<br />
<math>\displaystyle \sigma\,\!(a) = \frac {e^a}{1+e^a} = \frac {1}{1+e^{-a}}</math><br /><br /><br />
<br />
In two-class logistic regression, we compare the posterior of one class to the other one using this ratio:<br /><br />
<br />
:<math> \frac{P(Y=1|X=x)}{P(Y=0|X=x)}</math><br /><br />
<br />
If we look at the natural logarithm of this ratio, we find that it is always a linear function in <math>x</math>:<br /><br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\underline{\beta}^T\underline{x} \quad \rightarrow (*)</math> <br /><br /><br />
<br />
What if we have more than two classes?<br /><br />
<br />
Using (*), we can extend the notion of logistic regression for the cases where we have more than two classes.<br /><br />
<br />
Assume we have <math>k</math> classes. Looking at the logarithm of the ratio of posteriors of each class and the k<sup>th</sup> class, we have: <br /><br />
<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_1}^T\underline{x} </math> <br /><br />
:<math>\log\left(\frac{P(Y=2|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_2}^T\underline{x} </math> <br /><br />
::::<math> \vdots</math><br /><br />
:<math>\log\left(\frac{P(Y=k-1|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_{k-1}}^T\underline{x} </math> <br /><br />
<br />
<br />
Although in the above posterior ratios, the denominator is chosen to be the posterior of the last class (class k), the choice of denominator is arbitrary in that the posterior estimates are equivariant under this choice - [http://www.springerlink.com/content/t45k620382733r71/ Linear Methods for Classification].<br /><br /><br />
<br />
Each of these functions is linear in <math>x</math>, however, we have different <math>\underline{\,\beta_{i}}</math>'s. We have to make sure that, the densities assigned to different classes sum to one.<br /><br /><br />
<br />
In general, we can write:<br />
<br /><math>P(Y=c | X=x) = \frac{e^{\underline{\beta_c}^T \underline{x}}}{1+\sum_{l=1}^{k-1}e^{\underline{\beta_l}^T \underline{x}}},\quad c \in \{1,\dots,k-1\} </math><br /><br />
<br /><math>P(Y=k | X=x) = \frac{1}{1+\sum_{l=1}^{k-1}e^{\underline{\beta_l}^T \underline{x}}}</math><br /><br />
These posteriors clearly sum to one. <br /><br /><br />
<br />
Note that logistic regression do not assume a distribution for the prior where as LDA assumes the prior to be Bernulli.<br />
<br />
In the case of two-class problem, it is pretty simple to find <math>\beta</math> parameter (the <math>\beta</math> in two-class linear regression problems has <math>(d+1)\times1</math> dimension), as mentioned in previous lectures. In the multi-class case the iterative Newton method can be used, but here <math>\beta</math> is of size <math>(d+1)\times(k-1)</math> and the weight matrix W is a dense and non-diagonal matrix. This results in computationally inefficient, however feasible-to-be-solved algorithm. A trick would be to re-parametrize the logistic regression problem by expanding the input vector <math>x</math> (Question.4 in assignment no.2).<br />
<br /><br /><br />
<br />
===Nueral Network Concept===<br />
The concept of constructing an artificial neural network comes from scientists who like to simulate human neural network in their computers. They were trying to create computer programs that can learn like people. Neural network is a method in artificial intelligence which is a simplified models of neural processing in the brain, even though the relation between this model and brain biological architecture is not cleared yet.<br />
<br />
=== Perceptron ===<br />
<br />
Perceptron is a building block of Neural Networks. [http://en.wikipedia.org/wiki/Perceptron Perceptron] was invented in 1957 by [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt]. It is the basic building block of feedforward neural networks<br /><br /><br />
<br />
We know that least square obtained by regression of -1/1 response variable <math>\displaystyle Y</math> on observation <math>\displaystyle x</math>, lead to same coefficients as LDA. Recall that LDA minimizes the distance between discriminant function (decision boundary) and the data points. Least Square returns the sign of the linear combination of features as the class labels (figure 2). This was called perceptron in Engineering literature during 1950's. <br /><br /><br />
<br />
[[File:Perceptron.jpg|371px|thumb|right| Fig.2 Diagram of a linear perceptron ]]<br />
<br />
There is a cost function <math>\displaystyle D</math> that perceptron tries to minimize:<br /><br />
<br />
<math>D(\underline{\beta},\beta_0)=-\sum_{i \in M}y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math><br /><br />
<br />
where <math>\displaystyle M</math> is a set of misclassified points. <br /><br />
<br />
This is basically minimizing the sum of distances between the misclassified points and the decision boundary.<br /><br /><br />
<br />
'''Derivation''':'' The distances between the misclassified points and the decision boundary''.<br /><br /><br />
<br />
Consider points <math>\underline{x_1}</math>, <math>\underline{x_2}</math> and a decision boundary defined as <math>\underline{\beta}^T\underline{x}+\beta_0</math> as shown in figure 3.<br /><br />
<br />
[[File:DB.jpg|248px|thumb|right| Fig.3 Distance from the decision boundary ]]<br />
<br />
Both <math>\underline{x_1}</math> and <math>\underline{x_2}</math> lie on the decision boundary, then we have:<br /><br />
<math>\underline{\beta}^T\underline{x_1}+\beta_0=0 \rightarrow (1)</math><br /><br />
<math>\underline{\beta}^T\underline{x_2}+\beta_0=0 \rightarrow (2)</math><br /><br />
<br />
From (1) and (2):<br /><br />
<math>\underline{\beta}^T(\underline{x_2}-\underline{x_1})=0</math><br /><br />
<br />
Therefore, <math>\displaystyle \underline{\beta}</math> is orthogonal to <math>\underline{x_2}-\underline{x_1}</math> which is in the same direction with the decision boundary, which means that <math>\displaystyle \underline{\beta}</math> is orthogonal to the decision boundary. <br /><br />
<br />
Then the distance of a point <math>\underline{x_0}</math> from the decision boundary is: <br /><br />
<br />
<math>\underline{\beta}^T(\underline{x_0}-\underline{x_2})</math><br /><br />
<br />
From (2): <br /><br />
<br />
<math>\underline{\beta}^T\underline{x_2}= -\beta_0</math>. <br /><br />
<math>\underline{\beta}^T(\underline{x_0}-\underline{x_2})=\underline{\beta}^T\underline{x_0}-\underline{\beta}^T\underline{x_2}=\underline{\beta}^T\underline{x_0}+\beta_0</math><br /><br />
<br />
Therefore, distance between any point <math>\underline{x_{i}}</math> to the discriminant hyperplane is defined by <math>\underline{\beta}^T\underline{x_{i}}+\beta_0</math>.<br /><br /><br />
<br />
However, this quantity is not always positive. Considering <math>y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math>, if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive, since both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> are positive or both are negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'', then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> is positive and the other one is negative; hence, the product <math>y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math> will be negative for a misclassified point. The "-" sign in <math>D(\underline{\beta},\beta_0)</math> makes this cost function always positive. <br /><br /><br />
<br />
==Perceptron Learning Algorithm and Feed Forward Neural Networks - October 21, 2010 ==<br />
===Lecture Summary===<br />
In this lecture, we finalize our discussion of the Perceptron by reviewing its learning algorithm, which is based on gradient descent. We then begin the next topic, Neural Networks (NN), and we focus on a NN that is useful for classification: the Feed Forward Neural Network (FFNN). The mathematical model for the FFNN is shown, and we review one of its most popular learning algorithms: Back-Propagation. <br />
<br />
To open the Neural Network discussion, we present a formulation of the universal function approximator. The mathematical model for Neural Networks is then built upon this formulation. We also discuss the trade-off between training error and testing error -- known as the generalization problem -- under the universal function approximator section.<br />
<br />
===Perceptron===<br />
The last lecture introduced the Perceptron and showed how it can suggest a solution for the 2-class classification problem. We saw that the solution requires minimization of a cost function, which is basically a summation of the distances of the misclassified data points to the separating hyperplane. This cost function is<br />
<br />
<math>D(\underline{\beta},\beta_0)=-\sum_{i \in M}y_{i}(\underline{\beta}^T \underline{x}_i+\beta_0),</math><br />
<br />
in which, <math>\,M</math> is the set of misclassified points. Thus, the objective is to find <math>\arg\min_{\underline{\beta},\beta_0} D(\underline{\beta},\beta_0)</math>.<br />
<br />
====Perceptron Learning Algorithm====<br />
To minimize <math>D(\underline{\beta},\beta_0)</math>, an algorithm that uses gradient-descent has been suggested. Gradient descent, also known as steepest descent, is a numerical optimization technique that starts from an initial value for <math>(\underline{\beta},\beta_0)</math> and recursively approaches an optimal solution. Each step of recursion updates <math>(\underline{\beta},\beta_0)</math> by subtracting from it a factor of the gradient of <math>D(\underline{\beta},\beta_0)</math>. Mathematically, this gradient is<br />
<br />
<math>\nabla D(\underline{\beta},\beta_0)<br />
= \left( \begin{array}{c}\cfrac{\partial D}{\partial \underline{\beta}} \\ \\ <br />
\cfrac{\partial D}{\partial \beta_0} \end{array} \right)<br />
= \left( \begin{array}{c} -\displaystyle\sum_{i \in M}y_{i}\underline{x}_i^T \\ <br />
-\displaystyle\sum_{i \in M}y_{i} \end{array} \right)</math><br />
<br />
However, the perceptron learning algorithm does not use the sum of the contributions from each observation to calculate the gradient for each step. Instead, each step uses the gradient contribution from only a single observation, and each successive step uses a different observation. This slight modification is called stochastic gradient descent. That is, instead of subtracting some factor of <math>\nabla D(\underline{\beta},\beta_0)</math> at each step, we subtract a factor of<br />
<br />
<math>\left( \begin{array}{c} y_{i}\underline{x}_i \\ <br />
y_{i} \end{array} \right)</math><br />
<br />
As a result, the pseudo code for the Perceptron Learning Algorithm is as follows:<br />
<br />
:1) Choose a random initial value for <math>(\underline{\beta},\beta_0)</math>.<br />
<br />
:2) <math>\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{old}}\\<br />
\beta_0^{\mathrm{old}}<br />
\end{pmatrix}<br />
\leftarrow<br />
\begin{pmatrix}<br />
\underline{\beta}^0\\<br />
\beta_0^0<br />
\end{pmatrix}</math><br />
<br />
:3) <math>\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{new}}\\<br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
\leftarrow<br />
\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{old}}\\<br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix}<br />
y_i \underline{x_i}\\<br />
y_i<br />
\end{pmatrix}</math> for some <math>\,i \in M</math>.<br />
<br />
:4) If the termination criterion has not been met, go back to step 3 and use a different observation datapoint (i.e. a different <math>\,i</math>).<br />
<br />
The learning rate <math>\,\rho</math> controls the step size of convergence toward <math>\min_{\underline{\beta},\beta_0} D(\underline{\beta},\beta_0)</math>. A larger value for <math>\,\rho</math> causes the steps to be larger. If <math>\,\rho</math> is set to be too large, however, then the minimum could be missed (over-stepped).<br />
In practice, <math>\rho</math> can be adaptive and not fixed, it means that, in the first steps <math>\rho</math> could be larger than the last steps. At the beginning, larger <math>\rho</math> helps to find the approximate answer sooner. And smaller <math>\rho</math> in last steps help to tune the final answer more accurately. <br />
<br />
<br />
As mentioned earlier, the learning algorithm uses just one of the data points at each iteration; this is the common practice when dealing with online applications. In an online application, datapoints are accessed one-at-a-time because training data is not available in batch form. The learning algorithm does not require the derivative of the cost function with respect to the previously seen points; instead, we just have to take into consideration the effect of each new point.<br />
<br />
One way that the algorithm could terminate is if there are no more mis-classified points (i.e. if set <math>\,M</math> is empty. As long as there are points in <math>\,M</math>, the algorithm continues until some other termination criterion is reached. Termination criterion for an optimization algorithm is usually convergence, but for numerical methods this is not well-defined. In theory, convergence is realized when the gradient of the cost function is zero; in numerical methods an answer close to zero within some margin of error is taken instead.<br />
<br />
Since the data is linearly-separable, the solution is theoretically guaranteed to converge in a finite number of iterations. This number of iterations depends on the <br />
<br />
* learning rate <math>\,\rho</math><br />
<br />
* initial value <math>(\underline{\beta},\beta_0)</math><br />
<br />
* difficulty of the problem. The problem is more difficult if the gap between the classes of data is very small.<br />
<br />
Note that we consider the offset term <math>\beta_0</math> separately from the <math>\underline{\beta}</math> to distinguish this formulation from those in which the direction of the hyperplane (<math>\underline{\beta}</math>) has been considered.<br />
<br />
A major concern about gradient descent is that it may get trapped in local optimal solutions.<br />
<br />
====Some notes on the Perceptron Learning Algorithm====<br />
<br />
* If there is access to the training data points in a batch form, we should better take advantage of a closed optimization technique like least-squares or maximum-likelihood estimation for linear classifiers. (These closed solutions has been around many years before invention of the Perceptron).<br />
<br />
* Just like the linear classifier, a Perceptron can discriminate between only two classes at a time, and one can generalize its performance for multi-class problems by using one of the <math>k-1</math>, <math>k</math>, or <math>k(k-1)/2</math>-hyperplane methods.<br />
<br />
* If the two classes are linearly separable, the algorithm will converge in a finite number of iterations to a hyperplane, which makes the error of training data zero. The convergence is guaranteed if the learning rate is set adequately.<br />
<br />
* If the two classes are not linearly separable, the algorithm will never converge. So, one may think of a termination criterion in these cases. (e.g. a maximum number of iterations in which convergence is expected, or the rate of changes in both a cost function and its derivative).<br />
<br />
* In the case of linearly separable classes, the final solution and number of iterations will be dependent on the initial conditions, learning rate, and the gap between the two classes. In general, a smaller gap between classes requires a greater number of iterations for the algorithm to converge.<br />
<br />
* Learning rate --or updating step-- has a direct impact on both number of iterations and the accuracy of the solution for the optimization problem. Smaller quantities for this factor make convergence slower, even though we will end up with a more accurate solution. In the opposite way, larger values for learning rate make the process faster, even though we may lose some precision. So, one may make a balance for this trade-off in order to get fast enough to an accurate enough solution. (exploration vs. exploitation)<br />
<br />
In the upcoming lectures, we introduce the Support Vector Machines (SVM), which use a method similar in iterational optimization scheme to what the Perceptron suggests, but have a different definition for the cost function.<br />
<br />
===Universal Function Approximator===<br />
The universal function approximator is a mathematical formulation for a group of estimation techniques. The usual formulation for it is<br />
<br />
<math>\hat{Y}(x)=\sum\limits_{i=1}^{n}\alpha_i\sigma(\omega_i^Tx+b_i),</math><br />
<br />
where <math>\hat{Y}(x)</math> is an estimation for a function like <math>\,Y(x)</math>. According to the universal approximation theorem we have<br />
<br />
<math>|\hat{Y}(x) - Y(x)|<\epsilon,</math><br />
<br />
which means that <math>\hat{Y}(x)</math> can get as close to <math>\,Y(x)</math>, as necessary.<br />
<br />
This formulation assumes that the output, <math>\,Y(x)</math>, is a linear combination of a set of functions like <math>\,\sigma(.)</math> where <math>\,\sigma(.)</math> is a nonlinear function of the inputs or <math>\,x_i</math>s.<br />
<br />
====Generalization Factors====<br />
Even though this formulation represents a universal function approximator, which means that it can be fitted to a set of data as closely as demanded, the closeness of fit must be carefully decided upon. In many cases, the purpose of the model is to target unseen data. However, the fit to this unseen data is impossible to determine before it arrives.<br />
<br />
To overcome this dilemma, a common practice is to divide the test data points into two sets: training data and validation data. We use the training data to estimate the fixed parameters for the model, and then use the validation data to find values for the construction-dependent parameters. How these construction-dependent parameters vary depends on the model. In the case of a polynomial, the construction-dependent parameter would be its highest degree, and for a neural network, the construction-dependent parameter could be the number of hidden layers and the number of neurons in each layer.<br />
<br />
These matters on model generalization vs. complexity matters will be discussed with more detail in the lectures to follow.<br />
<br />
===Feed-Forward Neural Network===<br />
The Neural Network (NN) is one application of the universal function approximator. It can be thought of as a system of Perceptrons linked together as units of a network. One particular NN useful for classification is the Feed-Forward Neural Network (FFNN), which consists of multiple "hidden layers" of Perceptron units. Our discussion here is based around the FFNN, which has a toplogy shown in Figure 1. The first hidden layer of units receive input from the original features. Between the hidden layers, connections from each unit are always directed to units in the next adjacent layer. The output layer, which receives input only from the last hidden layer, each unit produces a target measurement for a distinct class (i.e. <math>\,K</math> classes require <math>\,K</math> units). In Figure 1, the units in a single layer are distributed vertically, and the inputs and outputs of the network are shown as the far left and right layers respectively.<br />
<br />
[[File:FFNN.png|300px|thumb|right|Fig.1 A common architecture for the FFNN]]<br />
<br />
====Mathematical Model of the FFNN with One Hidden Layer====<br />
The FFNN with one hidden layer for a <math>\,K</math>-class problem is defined as follows. Let <math>\,d</math> be the number of input features, <math>\,p</math> be the number of units in the hidden layer, and <math>\,K</math> be the number of classes (i.e. the number of units in the output layer).<br />
<br />
Each neural unit calculates its derived feature (i.e. output) using a linear combination of its inputs. Suppose <math>\,\underline{x}</math> is the <math>\,d</math>-dimensional vector of input features. Then, each neural unit uses a <math>\,d</math>-dimensional vector of weights to combine these input features: for the <math>\,i</math>th neural unit, let <math>\underline{u}_i</math> be this vector of weights. The linear combination calculated by the <math>\,i</math>th unit is then given by<br />
<br />
<math>a_i = \underline{u}_i^T\underline{x}</math><br />
<br />
However, we want the derived feature to lie between 0 and 1, so we apply an ''activating function'' <math>\,\sigma(a)</math>. The derived feature for the <math>\,i</math>th unit is then given by<br />
<br />
<math>\,z_i = \sigma(a_i)</math> where <math>\,\sigma</math> is typically the logistic function<br />
<br />
<math>\sigma(a) = \cfrac{1}{1+e^{-a}}</math><br />
<br />
Now, we place each of the derived features <math>\,z_i</math> from the hidden layer into a <math>\,p</math>-dimensional vector:<br />
<br />
<math>\underline{z} = \left[ \begin{array}{c} z_1 \\ z_2 \\ \vdots \\ z_p \end{array}\right]</math><br />
<br />
Like in the hidden layer, each unit in the output layer calculates its derived feature using a linear combination of its inputs. Each neural unit uses a <math>\,p</math>-dimensional vector of weights to combine the input features derived from the hidden layer. Let <math>\,\underline{w}_k</math> be this vector of weights used in the <math>\,k</math>th unit. The linear combination calculated by the <math>\,k</math>th unit is then given by<br />
<br />
<math>\hat{y}_k = \underline{w}_k^T\underline{z}</math><br />
<br />
<math>\,y_k</math> is thus the target measurement for the <math>\,k</math>th class. Note that an activation function <math>\,\sigma</math> is not used here.<br />
<br />
Notice that in each of the units, two operations take place:<br />
<br />
* a linear combination of the neuron's inputs is calculated using corresponding weights<br />
<br />
* a nonlinear operation on the linear combination is performed. <br />
<br />
These two calculations are shown in Figure 2. <br />
<br />
The nonlinear function <math>\,\sigma(.)</math> is called the activation function. Activation functions, like the logarithmic function shown earlier, are usually continuous and have a limited range. Another common activation function used in neural networks is <math>\,tanh(x)</math> (Figure 3).<br />
<br />
[[File:neuron2.png|300px|thumb|right|Fig.2 A general construction for a single neuron]]<br />
[[File:actfcn.png|300px|thumb|right|Fig.3 <math>tanh</math> as activation function]]<br />
<br />
The NN can be applied as a regression method or as a classifier, and the output layer differs depending on the application. The major difference between regression and classification is in the output space of the model, which is continuous in the case of regression, and discrete in the case of classification. For a regression task, no consideration is needed beyond what has already been mentioned earlier, since the outputs of the network would already be continuous. However, to use the neural network as a classifier, a threshold stage is necessary.<br />
<br />
====Mathematical Model of the FFNN with Multiple Hidden Layers====<br />
In the FFNN model with a single hidden layer, the derived features were represented as elements of the vector <math>\underline{z}</math>, and the original features were represented as elements of the vector <math>\underline{x}</math>. In the FFNN model with more than one hidden layer, <math>\underline{z}</math> is processed by the second hidden layer in the same way that <math>\underline{x}</math> was processed by the first hidden layer. Perceptrons in the second layer each use their own combination of weights to calculate a new set of derived features. These new derived features are processed by the third hidden layer in a similar way, and the cycle repeats for each additional hidden layer. This progression of processing is depicted in Figure 4.<br />
<br />
====Back-Propagation Learning Algorithm====<br />
<br />
[[File:bpl.png|300px|thumb|right|Fig.4 Labels for weights and derived features in the FFNN.]]<br />
<br />
Every linear-combination calculation in the FFNN involves weights that need to be set, and these weights are set using training data and an algorithm called Back-Propagation. This algorithm is similar to the gradient-descent algorithm introduced in the discussion of the Perceptron. The primary difference is that the gradient used in Back-Propagation is calculated in a more complicated way.<br />
<br />
First of all, we want to minimize the error between the estimated and true target measurements for the training data. That is, if <math>\,U</math> is the set of all weights in the FFNN, then we want to determine<br />
<br />
<math>\arg\min_U \left|y - \hat{y}\right|^2</math><br />
<br />
Now, suppose the hidden layers of the FFNN are labelled as in Figure 4. Then, we want to determine the derivative of <math>\left|y - \hat{y}\right|^2</math> with respect to each weight in the hidden layers of the FFNN. For weights <math>\,u_{jl}</math> this means we will need to find<br />
<br />
<math><br />
\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}}<br />
= \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j}\cdot<br />
\cfrac{\partial a_j}{\partial u_{jl}} = \delta_{j}z_l<br />
</math><br />
<br />
However, the closed-form solution for <math>\,\delta_{j}</math> is unknown, so we develop a recursive definition (<math>\,\delta_{j}</math> in terms of <math>\,\delta_{i}</math>):<br />
<br />
<math><br />
\delta_j = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j} <br />
= \sum_{i=1}^p \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_i}\cdot<br />
\cfrac{\partial a_i}{\partial a_j} <br />
= \sum_{i=1}^p \delta_i\cdot u_{ij} \cdot \sigma'(a_j)<br />
= \sigma'(a_j)\sum_{i=1}^p \delta_i \cdot u_{ij}<br />
</math><br />
<br />
We also need to determine the derivative of <math>\left|y - \hat{y}\right|^2</math> with respect to each weight in the ''output layer'' <math>\,k</math> of the FFNN (this layer is not shown in Figure 4, but it would be the next layer to the right of the rightmost layer shown). For weights <math>\,u_{ki}</math> this means<br />
<br />
<math><br />
\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{ki}}<br />
= \cfrac{\partial \left|y - \sum_i u_{ki}z_i\right|^2}{\partial u_{ki}}<br />
= -2(y - \sum_i u_{ki}z_i)z_i<br />
= -2(y - \hat{y})z_i<br />
</math><br />
<br />
With similarity to our computation of <math>\,\delta_j</math>, we define<br />
<br />
<math>\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_k}</math><br />
<br />
However, <math>\,a_k = \hat{y}</math> because an activation function is not applied in the output layer. So, our calculation becomes<br />
<br />
<math>\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial \hat{y}}<br />
= -2(y - \hat{y})</math><br />
<br />
Now that we have <math>\,\delta_k</math> and a recursive definition for <math>\,\delta_j</math>, it is clear that our weights can be deduced by starting from the output layer and working through the hidden layers through toward the input layer.<br />
<br />
Based on the above derivation, our algorithm for determining weights in the FFNN is as follows<br />
<br />
:1) Choose a random initial weights.<br />
<br />
:2) Apply a new datapoint <math>\underline{x}</math> to the FFNN as the input layer, and calculate the values for all units.<br />
<br />
:3) Compute <math>\,\delta_k = -2(y_k - \hat{y}_k)</math>.<br />
<br />
:4) Back-propagate layer-by-layer by computing <math>\delta_j = \sigma'(a_j)\sum_{i=1}^p \delta_i \cdot u_{ij}</math> for all units.<br />
<br />
:5) Compute <math>\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} = \delta_{j}z_l</math> for all weights <math>\,u_{jl}</math>.<br />
<br />
:6) Update <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}}<br />
- \rho \cdot \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} </math> for all weights <math>\,u_{jl}</math> where <math>\,\rho</math> is the learning rate.<br />
<br />
:7) If the termination criterion has not been met, go back to step 2 and apply another datapoint (ie. begin a new "epoch").<br />
<br />
====Alternative Description of the Back-Propagation Algorithm====<br />
Label the inputs and outputs of the <math>\,i</math>th hidden layer <math>\underline{x}_i</math> and <math>\underline{y}_i</math> respectively, and let <math>\,\sigma(.)</math> be the activation function for all neurons. We now have<br />
<br />
<math>\begin{align}<br />
\begin{cases}<br />
\underline{y}_1=\sigma(W_1.\underline{x}_1),\\<br />
\underline{y}_2=\sigma(W_2.\underline{x}_2),\\<br />
\underline{y}_3=\sigma(W_3.\underline{x}_3),<br />
\end{cases}<br />
\end{align}</math><br />
<br />
Where <math>\,W_i</math> is a matrix of the connection's weights, between two layers of <math>\,i</math> and <math>\,i+1</math>, and has <math>\,n_i</math> columns and <math>\,n_i+1</math> rows, where <math>\,n_i</math> is the number of neurons of the <math>\,i^{th}</math> layer.<br />
<br />
Considering this matrix equations, one can imagine a closed form for the derivative of the error in respect to the weights of the network. For a neural network with two hidden layers, the equations are as follows.<br />
<br />
<math>\begin{align}<br />
\frac{\partial E}{\partial W_3}=&diag(e).\sigma'(W_3.\underline{x}_3).(\underline{x}_3)^T,\\<br />
\frac{\partial E}{\partial W_2}=&\sigma'(W_2.\underline{x}_2).(\underline{x}_2)^T.diag\{\sum rows\{diag(e).diag(\sigma'(W_3.\underline{x}_3)).W_3\}\},\\<br />
\frac{\partial E}{\partial W_1}=&\sigma'(W_1.\underline{x}_1).(\underline{x}_1)^T.diag\{\sum rows\{diag(e).diag(\sigma'(W_3.\underline{x}_3)).W_3.diag(\sigma'(W_2.\underline{x}_2)).W_2\}\},<br />
\end{align}</math><br />
<br />
where <math>\,\sigma'(.)</math> is the derivative of the activation function <math>\,\sigma(.)</math>.<br />
<br />
Using this closed form derivative, it is possible to code the procedure for any number of layers and neurons. Here is a Matlab code for backpropagation algorithm. (<math>\,tanh</math> is utilized as the activation function.)<br />
<br />
<br />
while i < ep<br />
i = i + 1;<br />
data = shuffle(data,2);<br />
for j = 1:Q<br />
io = zeros(max(n)+1,length(n));<br />
gp = io;<br />
io(1:n(1)+1,1) = [1;data(1:f,j)];<br />
for k = 1:l<br />
io(2:n(k+1)+1,k+1) = w(2:n(k+1)+1,1:n(k)+1,k)*io(1:n(k)+1,k);<br />
gp(1:n(k+1)+1,k) = [0;1./(cosh(io(2:n(k+1)+1,k+1))).^2];<br />
io(1:n(k+1)+1,k+1) = [1;tanh(io(2:n(k+1)+1,k+1))];<br />
wg(1:n(k+1)+1,1:n(k)+1,k) = diag(gp(1:n(k+1)+1,k))*w(1:n(k+1)+1,1:n(k)+1,k);<br />
end<br />
e = [0;io(2:n(l+1)+1,l+1) - data(f+1:dd,j)];<br />
wg(1:n(l+1)+1,1:n(l)+1,l) = diag(e)*wg(1:n(l+1)+1,1:n(l)+1,l);<br />
gp(1:n(l+1)+1,l) = diag(e)*gp(1:n(l+1)+1,l);<br />
d = eye(n(l+1)+1);<br />
E(i) = E(i) + 0.5*norm(e)^2;<br />
for k = l:-1:1<br />
w(1:n(k+1)+1,1:n(k)+1,k) = w(1:n(k+1)+1,1:n(k)+1,k) - ro*diag(sum(d,1))*gp(1:n(k+1)+1,k)*(io(1:n(k)+1,k)');<br />
d = d*wg(1:n(k+1)+1,1:n(k)+1,k);<br />
end<br />
end<br />
end<br />
<br />
====Some notes on the neural network and its learning algorithm====<br />
<br />
* The activation functions are usually linear around the origin. If this is the case, choosing random weights between the <math>\,-0.5</math> and <math>\,0.5</math>, and normalizing the data may boost up the algorithm in the very first steps of the procedure, as the linear combination of the inputs and weights falls within the linear area of the activation function.<br />
<br />
* Learning of the neural network using backpropagation algorithm takes place in epochs. An Epoch is a single pass through the entire training set.<br />
<br />
* It is a common practice to randomly change the permutation of the training data in each one of the epochs, to make the learning independent of the data permutation.<br />
<br />
* Given a set of data for training a neural network, one should keep aside a ratio of it as the validation dataset, to obtain a sufficient number of layers and number of neurons in each of the layers. The best construction may be the one which leads to the least error for the validation dataset. Validation data may not be used as the training of the network.<br />
<br />
* We can also use the validation-training scheme to estimate how many epochs is enough for training the network.<br />
<br />
* It is also common to use other optimization algorithms as steepest descent and conjugate gradient in a batch form.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
==== Difficulties of training deep architecture <ref>{{Cite journal | title = Exploring Strategies for Training Deep Neural Networks | url = http://jmlr.csail.mit.edu/papers/volume10/larochelle09a/larochelle09a.pdf | year = 2009 | journal = Journal of Machine Learning Research | page = 1-40 | volume = 10 | last1 = Larochelle | first1 = H. | last2 = Bengio | first2 = Y. | last3 = Louradour | first3 = J. | last4 = Lamblin | first4 = P. }}</ref> ====<br />
<br />
Given a particular task, a natural way to train a deep network is to frame it as an optimization<br />
problem by specifying a supervised cost function on the output layer with respect to the desired<br />
target and use a gradient-based optimization algorithm in order to adjust the weights and biases<br />
of the network so that its output has low cost on samples in the training set. Unfortunately, deep<br />
networks trained in that manner have generally been found to perform worse than neural networks<br />
with one or two hidden layers.<br />
<br />
We discuss two hypotheses that may explain this difficulty. The first one is that gradient descent<br />
can easily get stuck in poor local minima (Auer et al., 1996) or plateaus of the non-convex training<br />
criterion. The number and quality of these local minima and plateaus (Fukumizu and Amari, 2000)<br />
clearly also influence the chances for random initialization to be in the basin of attraction (via<br />
gradient descent) of a poor solution. It may be that with more layers, the number or the width<br />
of such poor basins increases. To reduce the difficulty, it has been suggested to train a neural<br />
network in a constructive manner in order to divide the hard optimization problem into several<br />
greedy but simpler ones, either by adding one neuron (e.g., see Fahlman and Lebiere, 1990) or one<br />
layer (e.g., see Lengell´e and Denoeux, 1996) at a time. These two approaches have demonstrated to<br />
be very effective for learning particularly complex functions, such as a very non-linear classification<br />
problem in 2 dimensions. However, these are exceptionally hard problems, and for learning tasks<br />
usually found in practice, this approach commonly overfits.<br />
<br />
This observation leads to a second hypothesis. For high capacity and highly flexible deep networks,<br />
there actually exists many basins of attraction in its parameter space (i.e., yielding different<br />
solutions with gradient descent) that can give low training error but that can have very different generalization<br />
errors. So even when gradient descent is able to find a (possibly local) good minimum<br />
in terms of training error, there are no guarantees that the associated parameter configuration will<br />
provide good generalization. Of course, model selection (e.g., by cross-validation) will partly correct<br />
this issue, but if the number of good generalization configurations is very small in comparison<br />
to good training configurations, as seems to be the case in practice, then it is likely that the training<br />
procedure will not find any of them. But, as we will see in this paper, it appears that the type of<br />
unsupervised initialization discussed here can help to select basins of attraction (for the supervised<br />
fine-tuning optimization phase) from which learning good solutions is easier both from the point of<br />
view of the training set and of a test set.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
===Business Applications of Neural Networks===<br />
<br />
Neural networks are increasingly being used in real-world business applications and, in some cases, such as fraud detection, they have already become the method of choice. Their use for risk assessment is also growing and they have been employed to visualize complex databases for marketing segmentation. This method covers a wide range of business interests — from finance management, through forecasting, to production. The combination of statistical, neural and fuzzy methods now enables direct quantitative studies to be carried out without the need for rocket-science expertise.<br />
<br />
* On the Use of Neural Networks for Analysis Travel Preference Data <br />
* Extracting Rules Concerning Market Segmentation from Artificial Neural Networks <br />
* Characterization and Segmenting the Business-to-Consumer E-Commerce Market Using Neural Networks<br />
* A Neurofuzzy Model for Predicting Business Bankruptcy <br />
* Neural Networks for Analysis of Financial Statements <br />
* Developments in Accurate Consumer Risk Assessment Technology <br />
* Strategies for Exploiting Neural Networks in Retail Finance <br />
* Novel Techniques for Profiling and Fraud Detection in Mobile Telecommunications<br />
* Detecting Payment Card Fraud with Neural Networks<br />
* Money Laundering Detection with a Neural-Network <br />
* Utilizing Fuzzy Logic and Neurofuzzy for Business Advantage</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=7366stat841f102010-10-25T21:24:19Z<p>Hclam: /* Multi-Class Logistic Regression & Perceptron - October 19, 2010 */</p>
<hr />
<div>==[[Proposal Fall 2010]] ==<br />
==[[statf10841Scribe|Editor sign up]] ==<br />
{{Cleanup|date=October 8 2010|reason=Provide a summary for each topic here.}}<br />
== Summary ==<br />
=== Classification ===<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>.<br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>.<br />
=== Principle Component Analysis ===<br />
Principal component analysis (PCA) is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf]. Depending on where this methodology is applied, other common names of PCA include the [http://en.wikipedia.org/wiki/Karhunen%E2%80%93Lo%C3%A8ve_theorem Karhunen–Loève transform (KLT)] , the [http://en.wikipedia.org/wiki/Harold_Hotelling Hotelling transform], and the proper orthogonal decomposition (POD). PCA is the simplist [http://en.wikipedia.org/wiki/Eigenvector eigenvector]-based [http://en.wikipedia.org/wiki/Multivariate_analysis multivariate analysis]. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or '''principal components''') of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could.<br />
<br />
==[[f10_Stat841_digest |Digest ]] ==<br />
<br />
== ''' Reference Textbook''' ==<br />
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
== ''' Classification - September 21, 2010''' ==<br />
<br />
=== Classification ===<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which ones were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle (384 BC - 322 BC) when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression], [http://en.wikipedia.org/wiki/Cluster_analysis clustering], and [http://en.wikipedia.org/wiki/Dimension_reduction dimensionality reduction] (feature extraction or manifold learning). Please be noted that some people consider classification to be a broad area that consists of both supervised and unsupervised methods of classifying data. In this view, as can be seen in [http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm this link], clustering is simply a special case of classification and it may be called '''unsupervised classification'''.<br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers, a link to a source of which can be found [http://www.e-knowledge.ca/quotes.php?topic=Knowledge here].<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers <br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that can take a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
<br />
The '''true error rate''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
<br />
In practice, the empirical error rate is obtained to estimate the true error rate, whose value is impossible to be known because the parameter values of the underlying process cannot be known but can only be estimated using available data. The empirical error rate, in practice, estimates the true error rate quite well in that, as mentioned [http://www.liebertonline.com/doi/pdf/10.1089/106652703321825928 here], it is an unbiased estimator of the true error rate.<br />
<br />
=== Bayes Classifier ===<br />
<br />
{{Cleanup|date=October 14 2010|reason=In response to the previous tag: The naive Bayes classifier is a special (simpler) case of the Bayes classifier. It uses an extra assumption: that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. This assumption allows for an easier likelihood function <math>\,f_y(x)</math> in the equation:<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
The simper form of the likelihood function seen in the naive Bayes is:<br />
:<math><br />
\begin{align}<br />
f_y(x) = P(X=x|Y=y) = {\prod_{i=1}^{n} P(X_{i}=x_{i}|Y=y)}<br />
\end{align}<br />
</math><br />
The Bayes classifier taught in class was not the naive Bayes classifier. Perhaps a comment should be made about the naive Bayes classifier in the body of the text}}<br />
<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' Theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers [1]. Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests][2].<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into the most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class <math> y \in \mathcal{Y} </math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the posterior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h^*</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
The Bayes classifier is the optimal classifier in that it results in the least possible true probability of misclassification for any given new data input, i.e., for any generic classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief. As a result, the estimated values of the components in the trained model may deviate quite a bit from their true population values, and this can ultimately cause the calculated posterior probabilities of inputs to deviate quite a bit from their true values. Estimation of all these probability functions, as likelihood, prior probability, and evidence function is a very expensive task, computationally, which also makes some other classifiers more favorable than Bayes classifier.<br />
<br />
A detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here].<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h^*</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h^*)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
'''Theorem'''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course. ''Note: these are the known y values in the training data.'' <br />
<br />
These known data are summarized in the following tables:<br />
<br />
:[[File:裁剪.jpg]]<br />
{{Cleanup|date=September 2010|reason=this graph is not complete, the reason is that it should be in consistent with the computation below.}}<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
<br />
The Bayesian view of probability states that, for any event E, event E has a [http://en.wikipedia.org/wiki/Prior_probability prior probability] that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability of event E's occurrence, can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>. If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow. This is because one cannot possibly carry out trials for any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h^*)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the hyperplane the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and we shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that both of the two classes have [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distributions] and the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> to <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of <math>\,D(h^*)</math> is as follows:<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out alike terms and factoring).<br />
<br />
It is easy to see that, under LDA, the Bayes's classifier's decision boundary <math>\,D(h^*)</math> has the form <math>\,ax+b=0</math> and it is linear in <math>\,x</math>. This is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\left( \mu_m^\top\Sigma^{-1}<br />
\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n) \right)=0</math> . In addition, for any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between the centers (means) of <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice, where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
According to [http://www.lsv.uni-saarland.de/Vorlesung/Digital_Signal_Processing/Summer06/dsp06_chap9.pdf this link], some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that the data in each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - September 23, 2010''' ==<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>.<br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>.<br />
<br />
The derivation of the Bayes classifier's decision boundary <math>\,D(h^*)</math> under QDA is similar to that under LDA. Again, let us first consider the two-classes case where <math>\, \mathcal{Y}=\{0, 1\}</math>. This derivation is given as follows: <br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math> (by cancellation)<br />
:<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_1|)-\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1)+\log(\pi_1)=-\frac{1}{2}\log(|\Sigma_0|)-\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0)+\log(\pi_0)</math> (by taking the log of both sides)<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top\Sigma_1^{-1}x + \mu_1^\top\Sigma_1^{-1}\mu_1 - 2x^\top\Sigma_1^{-1}\mu_1 - x^\top\Sigma_0^{-1}x - \mu_0^\top\Sigma_0^{-1}\mu_0 + 2x^\top\Sigma_0^{-1}\mu_0 \right)=0</math> (by expanding out)<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top(\Sigma_1^{-1}-\Sigma_0^{-1})x + \mu_1^\top\Sigma_1^{-1}\mu_1 - \mu_0^\top\Sigma_0^{-1}\mu_0 - 2x^\top(\Sigma_1^{-1}\mu_1-\Sigma_0^{-1}\mu_0) \right)=0</math> <br />
<br />
It is easy to see that, under QDA, the decision boundary <math>\,D(h^*)</math> has the form <math>\,ax^2+bx+c=0</math> and it is quadratic in <math>\,x</math>. This is where the word ''quadratic'' in quadratic discriminant analysis comes from.<br />
<br />
As is the case with LDA, QDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\log(\frac{|\Sigma_m|}{|\Sigma_n|})-\frac{1}{2}\left( x^\top(\Sigma_m^{-1}-\Sigma_n^{-1})x + \mu_m^\top\Sigma_m^{-1}\mu_m - \mu_n^\top\Sigma_n^{-1}\mu_n - 2x^\top(\Sigma_m^{-1}\mu_m-\Sigma_n^{-1}\mu_n) \right)=0</math>.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h^*(x) = \arg\max_{k} \delta_k(x)</math> <br />
where, <br />
* In the case of LDA, which assumes that a common covariance matrix is shared by all classes, <math> \,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is linear in <math>\,x</math>.<br />
<br />
* In the case of QDA, which assumes that each class has its own covariance matrix, <math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is quadratic in <math>\,x</math>.<br />
<br />
<br />
'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
[http://www.stat.cmu.edu/~larry/=stat707/notes10.pdf See Theorem 46.6 Page 133]<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the Maximum Likelihood estimates from the sample for <math>\,\pi,\mu_k,\Sigma_k</math> in place of their true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance, denoted <math>\Sigma</math>, is defined as the weighted average of the covariance for each class. <br />
<br />
In the case where we need a common covariance matrix, we get the estimate using the following equation:<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{n-k} </math><br />
<br />
Where: <math>\,n_r</math> is the number of data points in class r, <math>\,\Sigma_r</math> is the covariance of class r and <math>\,n</math> is the total number of data points,<br />
<math>\,k</math> is the number of classes.<br />
<br />
See the details about the [http://en.wikipedia.org/wiki/Estimation_of_covariance_matrices estimation of covarience matrices].<br />
<br />
===Computation For QDA And LDA===<br />
<br />
First, let us consider QDA, and examine each of the following two cases.<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
<math>\, \Sigma_k = I </math> for every class <math>\,k</math> implies that our data is spherical. This means that the data of each class <math>\,k</math> is distributed symmetrically around the center <math>\,\mu_k</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{-1}{2}log(|I|)</math>, is zero since <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximize <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = U_kS_kV_k^\top = U_kS_kU_k^\top </math> (In general when <math>\,X=U_kS_kV_k^\top</math>, <math>\,U_k</math> is the eigenvectors of <math>\,X_kX_k^T</math> and <math>\,V_k</math> is the eigenvectors of <math>\,X_k^\top X_k</math>. <br />
So if <math>\, X_k</math> is symmetric, we will have <math>\, U_k=V_k</math>. Here <math>\, \Sigma_k </math> is symmetric, because it is the covariance matrix of <math> X_k </math>) and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (U_kS_kU_k^\top)^{-1} = (U_k^\top)^{-1}S_k^{-1}U_k^{-1} = U_kS_k^{-1}U_k^\top </math> (since <math>\,U_k</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top U_kS_k^{-1}U_k^T(x-\mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-1}(U_k^\top x-U_k^\top \mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top x-U_k^\top\mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the squared Euclidean distance between <math> \, S_k^{-\frac{1}{2}}U_k^\top x </math> and <math>\, S_k^{-\frac{1}{2}}U_k^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>.<br />
<br />
A similar transformation of all the centers can be done from <math>\,\mu_k</math> to <math>\,\mu_k^*</math> where <math> \, \mu_k^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top \mu_k </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math> and <math>\,\mu_k^*</math>, treating them as in Case 1 above.<br />
<br />
{{Cleanup|date=October 18 2010|reason=The sentence above may cause some misleading. In general case, <math>\,\Sigma_k </math> may not be the same . So you can't treat them completely the same as in Case 1 above. You need to compute <math>\, log{|\Sigma_k |} </math> differently. Here is a detailed discussion below:}}<br />
{{Cleanup|date=October 18 2010|reason=The sentence above is right since by transforming<math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>, the new variable variance is <math>I</math>}}<br />
<br />
<br />
Note that when we have multiple classes, we also need to compute <math>\, log{|\Sigma_k|}</math> respectively. Then we compute <math> \,\delta_k </math> for QDA .<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, in another word, have same covariance <math>\,\Sigma_k</math>,else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
{{Cleanup|date=October 18 2010|reason=The statement above may not be true, because in assignment 1, we did do the QDA computation using this approach although the corresponding three covarience matrices are different, the reason why the answer is Yes is as below }}<br />
<br />
The answer is Yes. Consider that you have two classes with different shapes. Given a data point, justify which class this point belongs to. You just do the transformations corresponding to the 2 classes respectively, then you get <math>\,\delta_1 ,\delta_2 </math> ,then you determine which class the data point belongs to by comparing <math> \,\delta_1 </math> and <math> \,\delta_2 </math> .<br />
<br />
In summary, to apply QDA on a data set <math>\,X</math>, in the general case where <math>\, \Sigma_k \ne I </math> for each class <math>\,k</math>, one can proceed as follows:<br />
<br />
:: Step 1: For each class <math>\,k</math>, apply singular value decomposition on <math>\,X_k</math> to obtain <math>\,S_k</math> and <math>\,U_k</math>.<br />
<br />
:: Step 2: For each class <math>\,k</math>, transform each <math>\,x</math> belonging to that class to <math>\,x^* = S_k^{-\frac{1}{2}}U_k^\top x</math>, and transform its center <math>\,\mu_k</math> to <math>\,\mu_k^* = S_k^{-\frac{1}{2}}U_k^\top \mu_k</math>.<br />
<br />
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x^*</math> and the transformed center <math>\,\mu^*</math> of each class, and assign <math>\,x</math> to the class such that the squared Euclidean distance between <math>\,x^*</math> and <math>\,\mu^*</math> is the least over all of the classes.<br />
<br />
<br />
Now, let us consider LDA. <br />
Here, one can derive a classification scheme that is quite similar to that shown above. The main difference is the assumption of a common variance across the classes, so we perform the Singular Value Decomposition once, as opposed to k times.<br />
<br />
To apply LDA on a data set <math>\,X</math>, one can proceed as follows:<br />
<br />
:: Step 1: Apply singular value decomposition on <math>\,X</math> to obtain <math>\,S</math> and <math>\,U</math>.<br />
<br />
:: Step 2: For each <math>\,x \in X</math>, transform <math>\,x</math> to <math>\,x^* = S^{-\frac{1}{2}}U^\top x</math>, and transform each center <math>\,\mu</math> to <math>\,\mu^* = S^{-\frac{1}{2}}U^\top \mu</math>.<br />
<br />
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x^*</math> and the transformed center <math>\,\mu^*</math> of each class, and assign <math>\,x</math> to the class such that the squared Euclidean distance between <math>\,x^*</math> and <math>\,\mu^*</math> is the least over all of the classes.<br />
<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In actual data scenarios, it is generally true that QDA will provide a better classifier for the data then LDA because QDA does not assume that the covariance matrix for each class is identical, as LDA assumes. However, QDA still assumes that the class conditional distribution is Gaussian, which is not always the case in real-life scenarios. The link provided at the beginning of this paragraph describes a kernel-based QDA method which does not have the Gaussian distribution assumption.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA - September 28, 2010==<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension. Pay attention, We don't do QDA with LDA. If we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below applies LDA to the same data set and reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that the algorithm created to separate the data into the two classes.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the classes of the points 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x*y+%g*y^2', k, l(1), l(2), q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that QDA is only correct in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve produced by QDA that do not lie on the correct side of the line produced by LDA.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Principal Component Analysis - September 30, 2010==<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br />
<br /><br />
Principal component analysis (PCA) is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf]. Depending on where this methodology is applied, other common names of PCA include the [http://en.wikipedia.org/wiki/Karhunen%E2%80%93Lo%C3%A8ve_theorem Karhunen–Loève transform (KLT)] , the [http://en.wikipedia.org/wiki/Harold_Hotelling Hotelling transform], and the proper orthogonal decomposition (POD). PCA is the simplist [http://en.wikipedia.org/wiki/Eigenvector eigenvector]-based [http://en.wikipedia.org/wiki/Multivariate_analysis multivariate analysis]. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or '''principal components''') of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could. <br />
<br />
<br />
Furthermore, if one considers the lower dimensional representation produced by PCA as a least squares fit of our original data, then it can also be easily shown that this representation is the one that minimizes the reconstruction error of our data. It should be noted however, that one usually does not have control over which dimensions PCA deems to be the most informative for a given set of data, and thus one usually does not know which dimensions PCA selects to be the most informative dimensions in order to create the lower-dimensional representation. <br />
<br />
<br />
Suppose <math>\,X</math> is our data matrix containing <math>\,d</math>-dimensional data. The idea behind PCA is to apply [http://en.wikipedia.org/wiki/Singular_value_decomposition singular value decomposition] to <math>\,X</math> to replace the rows of <math>\,X</math> by a subset of it that captures as much of the [http://en.wikipedia.org/wiki/Variance variance] in <math>\,X</math> as possible. First, through the application of singular value decomposition to <math>\,X</math>, PCA obtains all of our data's directions of variation. These directions would also be ordered from left to right, with the leftmost directions capturing the most amount of variation in our data and the rightmost directions capturing the least amount. Then, PCA uses a subset of these directions to map our data from its original space to a lower-dimensional space. <br />
<br />
<br />
By applying singular value decomposition to <math>\,X</math>, <math>\,X</math> is decomposed as <math>\,X = U\Sigma V^T \,</math>. The <math>\,d</math> columns of <math>\,U</math> are the [http://en.wikipedia.org/wiki/Eigenvector eigenvectors] of <math>\,XX^T \,</math>.<br />
The <math>\,d</math> columns of <math>\,V</math> are the eigenvectors of <math>\,X^TX \,</math>. The <math>\,d</math> diagonal values of <math>\,\Sigma</math> are the square roots of the [http://en.wikipedia.org/wiki/Eigenvalue eigenvalues] of <math>\,XX^T \,</math> (also of <math>\,X^TX \,</math>), and they correspond to the columns of <math>\,U</math> (also of <math>\,V</math>). <br />
<br />
<br />
We are interested in <math>\,U</math>, whose <math>\,d</math> columns are the <math>\,d</math> directions of variation of our data. Ordered from left to right, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most informative direction of variation of our data. That is, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most effective column in terms of capturing the total variance exhibited by our data. A subset of the columns of <math>\,U</math> is used by PCA to reduce the dimensionality of <math>\,X</math> by projecting <math>\,X</math> onto the columns of this subset. In practice, when we apply PCA to <math>\,X</math> to reduce the dimensionality of <math>\,X</math> from <math>\,d</math> to <math>\,k</math>, where <math>k < d\,</math>, we would proceed as follows:<br />
<br />
:: Step 1: Center <math>\,X</math> so that it would have zero mean.<br />
<br />
:: Step 2: Apply singular value decomposition to <math>\,X</math> to obtain <math>\,U</math>.<br />
<br />
:: Step 3: Suppose we denote the resulting <math>\,k</math>-dimensional representation of <math>\,X</math> by <math>\,Y</math>. Then, <math>\,Y</math> is obtained as <math>\,Y = U_k^TX</math>. Here, <math>\,U_k</math> consists of the first (leftmost) <math>\,k</math> columns of <math>\,U</math> that correspond to the <math>\,k</math> largest diagonal elements of <math>\,\Sigma</math>.<br />
<br />
<br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 130 images (28 by 23 pixels) of handwritten threes. <br />
{{Cleanup|date=September 6 2010|reason=If anyone can tell me where I can find the 2-3 data set, I would create the new image. In the mean time, I found a non-copyrighted image of different looking 3s online, but as you can see, it is not as nice as one we could make.}}<br />
{{Cleanup|date=September 6 2010|reason=I think you can find it on your UW-ACE account for this course.}}<br />
<br />
[[File:Handwritten 3s.gif]]<br />
<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 130 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[Image:23plotPCA.jpg]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slant of each handwritten digit.<br />
<br />
===Derivation of the first Principle Component===<br />
{{Cleanup|date=October 2010|reason=I think English of this section must be improved}}<br />
We want to find the direction of maximum variation. Let <math>\begin{align}\textbf{w}\end{align}</math> be an arbitrary direction, <math>\begin{align}\textbf{x}\end{align}</math> a data point and <math>\begin{align}\displaystyle u\end{align}</math> the length of the projection of <math>\begin{align}\textbf{x}\end{align}</math> in direction <math>\begin{align}\textbf{w}\end{align}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\begin{align}\textbf{w}\end{align}</math> is the same as <math>\begin{align}c\textbf{w}\end{align}</math>, for any scalar <math>c</math>, so without loss of generality, we assume that: <br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}.<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}. <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} .</math><br />
<br /><br /><br />
The above is the variance of <math>\begin{align}\displaystyle u \end{align}</math> formed by the weight vector <math>\begin{align}\textbf{w} \end{align}</math>. The first principal component is the vector <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the function. Our goal is to find the weight <math>\begin{align}\textbf{w} \end{align}</math> that maximizes this variability, subject to a constraint. Since our function is convex, it has no maximum value. Therefore we need to add a constraint that restricts the length of <math>\begin{align}\textbf{w} \end{align}</math>. However, we are only interested in the direction of the variability, so the problem becomes<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
s.t. <math>\textbf{w}^T \textbf{w} = 1.</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|.<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
====Lagrange Multiplier====<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector then the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
{{Cleanup|date=October 2010|reason=It is good discussion, what will happen if we don't have distinct eigenvalues and eigenvectors? What does this situation mean? }}<br />
{{Cleanup|date=October 2010|reason=If the eigenvalues are not distinct, I suppose we could still take the leftmost eigenvector by default. Not sure if this is the correct approach, so can anyone please explain further? Thanks }}<br />
{{Cleanup|date=October 2010|reason= As U is the eigenvector of a symetric matrix, is it possible that we have 2 similar eigen vector? }}<br />
<br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = Var(\sum_{i = 1}^n x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
m_X=mean(X,2);<br />
mm=repmat(m_X,1,300);<br />
XX=X-mm;<br />
[u s v] = svd(XX);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
xHat=xHat+mm;<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. This is because almost all of the noise in the noisy image is captured by the principal components (directions of variation) that capture the least amount of variation in the image, and these principal components were discarded when we used the few principal components that capture most of the image's variation to generate the image's lower-dimensional representation. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Extraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data.<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized as follows (taken from the Lecture Slides).<br />
<br />
====Algorithm ====<br />
'''Recover basis:''' Calculate <math> XX^T =\Sigma_{i=1}^{n} x_i x_{i}^{T} </math> and let <math> U=</math> eigenvectors of <math> X X^T </math> corresponding to the top <math> d </math> eigenvalues.<br />
<br />
'''Encoding training data:''' Let <math>Y=U^TX </math> where <math>Y</math> is a <math>d \times n</math> matrix of encoding of the original data.<br />
<br />
'''Reconstructing training data:''' <math>\hat{X}= UY=UU^TX </math>.<br />
<br />
'''Encode set example:''' <math> y=U^T x </math> where <math> y </math> is a <math>d-</math>dimentional encoding of <math>x</math>.<br />
<br />
'''Reconstruct test example:''' <math>\hat{x}= Uy=UU^Tx </math>.<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{D \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, U_dY_{d \times n} </math>, where <math>\,U_d</math> contains the first (leftmost) <math>\,d</math> columns of <math>\,U</math>.<br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
== Fisher's (Linear) Discriminant Analysis (FDA) - Two Class Problem - October 5, 2010 ==<br />
<br />
===Sir Ronald A. Fisher===<br />
Fisher's Discriminant Analysis (FDA), also known as Fisher's Linear Discriminant Analysis (LDA) in some sources, is a classical feature extraction technique. It was originally described in 1936 by Sir [http://en.wikipedia.org/wiki/Ronald_A._Fisher Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here]. <br />
In this paper Fisher used for the first time the term DISCRIMINANT FUNCTION. The term DISCRIMINANT ANALYSIS was introduced later by Fisher himself in a subsequent paper which can be found [http://digital.library.adelaide.edu.au/coll/special//fisher/155.pdf here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
As in PCA, the goal of FDA is to project the data in a lower dimension. You might ask, why was FDA invented when PCA already existed? There is a simple explanation for this that can be found [http://www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf here]. PCA is an unsupervised method for classification, so it does not take into account the labels in the data. Suppose we have two clusters that have very different or even opposite labels from each other but are nevertheless positioned in a way such that they are very much parallel to each other and also very near to each other. In this case, most of the total variation of the data is in the direction of these two clusters. If we use PCA in cases like this, then both clusters would be projected onto the direction of greatest variation of the data to become sort of like a single cluster after projection. PCA would therefore mix up these two clusters that, in fact, have very different labels. What we need to do instead, in this cases like this, is to project the data onto a direction that is orthogonal to the direction of greatest variation of the data. This direction is in the least variation of the data. On the 1-dimensional space resulting from such a projection, we would then be able to effectively classify the data, because these two clusters would be perfectly or nearly perfectly separated from each other taking into account of their labels. This is exactly the idea behind FDA. <br />
<br />
The main difference between FDA and PCA is that, in FDA, in contrast to PCA, we are not interested in retaining as much of the variance of our original data as possible. Rather, in FDA, our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for a direction that is most representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
Suppose we have 2-dimensional data, then FDA would attempt to project the data of each class onto a point in such a way that the resulting two points would be as far apart from each other as possible. Intuitively, this basic idea behind FDA is the optimal way for separating each pair of classes along a certain direction. <br />
<br />
{{Cleanup|date=October 2010|reason= Just a thought: how relevant is "Dimensionality reduction techniques" to the concept of "subspace clustering"? As in subspace clustering, the goal is to find a set of features (relevant features, the concept is referred to as local feature relevance in the literature) in the high dimensional space, where potential subspaces accommodating different classes of data points can be defined. This means; the data points are dense when they are considered in a subset of dimensions (features).}}<br />
{{Cleanup|date=October 2010|reason=If I'm not mistaken, classification techniques like FDA use labeled training data whereas clustering techniques use unlabeled training data instead. Any other input regarding this would be much appreciated. Thanks}}<br />
{{Cleanup|date=October 2010|reason=An extension of clustering is subspace clustering in which different subspace are searched through to find the relavant and appropriate dimentions. High dimentional data sets are roughly equiedistant from each other, so feature selection methods are used to remove the irrelavant dimentions. These techniques do not keep the relative distance so PCA is not useful for these applications. It should be noted that subspace clustering localize their search unlike feature selection algorithms.for more information click here[http://portal.acm.org/citation.cfm?id=1007731]}}<br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes:<br />
<br><br />
For a 2-classes problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k-classes problem, we want to reduce the data to k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within-variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
{{Cleanup|date=October2010|reason=Anyone please add an example to make the comparison clearer}}<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. In the paper,"[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] " written by our instructor, they propose a closed-form solution to one algorithm that previously required expensive semidefinite optimization. They provide a new problem setup in which the algorithm performs better or as well as some standard methods, but without the computational complexity. Furthermore, they show a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). They also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
===FDA Goals===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
==== Example in R ====<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
<br />
FDA projects the data into lower dimensional space, where the distances between the projected means are maximum and the within-class variances are minimum. There are two categories of classification problems:<br />
<br />
1. Two-class problem<br />
<br />
2. Multi-class problem (addressed next lecture)<br />
<br />
=== Two-class problem ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within each class''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances (the summation of the two covariances is a valid covariance, satisfying the symmetry and positive semi-definite criteria). <br />
<br />
{{Cleanup|date=October 2010|reason=In 2. above, I wonder if the computation would be much more complex if we instead find a weighted sum of the covariances of the two classes where the weights are the sizes of the two classes?}}<br />
<br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> where <math>\ z_i </math> is a scalar<br />
<br />
====1. Minimizing within-class variance==== <br />
<math>\displaystyle \min_w (\underline{w}^T\sum_1\underline{w}) </math><br />
<br />
<math>\displaystyle \min_w (\underline{w}^T\sum_2\underline{w}) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math><br />
<br> (where <math>\,\sum_1</math> and <math>\,\sum_2 </math> are the covariance matrices of the 1st and 2nd classes of data, respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within-classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math>.<br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br /> <br />
<math>\displaystyle \max_w ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2, </math><br />
<br /><br /><br />
<math>\begin{align} ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2 &= (\underline{w}^T \mu_1 - \underline{w}^T \mu_2)^T(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\<br />
&= (\mu_1^T\underline{w} - \mu_2^T\underline{w})(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\<br />
&= (\mu_1 - \mu_2)^T \underline{w} \underline{w}^T (\mu_1 - \mu_2) \\<br />
<br />
&= ((\mu_1 - \mu_2)^T \underline{w})^{T} (\underline{w}^T (\mu_1 - \mu_2))^{T} \\<br />
&= \underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T \underline{w} \end{align}</math><br /><br />
<br />
Note that in the last line above the order is rearranged clockwise because the answer is a scalar.<br />
<br />
Let <math>\displaystyle s_B = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math>, the between-class covariance, then the goal is to <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math>.<br />
<br />
===The Objective Function for FDA===<br />
We want an objective function which satisfies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math> or <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math><br />
# <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math> <br />
<br /><br /><br />
So, we construct our objective function as maximizing the ratio of the two goals brought above:<br /><br />
<br /><br />
<math>\displaystyle \max_w \frac{(\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w})} {(\underline{w}^T(\sum_1 + \sum_2)\underline{w})} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max_w \frac{(\underline{w}^Ts_B\underline{w})}{(\underline{w}^Ts_w\underline{w})}</math> <br /><br />
One may argue that we can use subtraction for this purpose, while this approach is true but it can be shown it will need another scaling factor. Thus using this ratio is more efficient.<br />
<br />
As the objective function is convex, and so it does not have a maximum. To get around this problem, we have to add the constraint that w must have unit length, and then solvethis optimization problem we form the lagrangian:<br />
<br />
<br /><br /><br />
<math>\displaystyle L(\underline{w},\lambda) = \underline{w}^Ts_B\underline{w} - \lambda (\underline{w}^Ts_w\underline{w} -1)</math><br /><br /><br />
<br />
<br /><br />
Then, we equate the partial derivative of L with respect to <math>\underline{w}</math>:<br />
<math>\displaystyle \frac{\partial L}{\partial \underline{w}}=2s_B \underline{w} - 2\lambda s_w \underline{w} = 0 </math> <br /><br />
<br />
<math>s_B \underline{w} = \lambda s_w \underline{w}</math><br /><br />
<math>s_w^{-1}s_B \underline{w}= \lambda\underline{w}</math><br /><br /><br />
This is in the form of generalized eigenvalue problem. Therefore, <math> \underline{w}</math> is the largest eigenvector of <math>s_w^{-1}s_B </math><br /><br />
<br />
This solution can be further simplified as follow:<br /><br />
<br />
<math>s_w^{-1}(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w} = \lambda\underline{w} </math><br /><br />
<br />
Since <math>(\mu_1 - \mu_2)^T\underline{w}</math> is a scalar then <math>s_w^{-1}(\mu_1 - \mu_2)</math>∝<math>\underline{w}</math> <br /><br /><br />
This gives the direction of <math>\underline{w}</math> without doing eigenvalue decomposition in the case of 2-class problem.<br />
<br />
Note: In order for <math>{s_w}</math> to have an inverse, it must have full rank. This can be achieved by ensuring that the number of data points <math>\,\ge</math> the dimensionality of <math>\underline{x_{i}}</math>.<br />
<br />
===FDA Using Matlab===<br />
Note: ''The following example was not actually mentioned in this lecture''<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods. <br />
%First of all, we generate the two data set:<br />
% First data set X1<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300);<br />
%In this case: <br />
mu_1=[1;1]; <br />
Sigma_1=[1 1.5; 1.5 3]; <br />
%where mu and sigma are the mean and covariance matrix.<br />
% Second data set X2<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300); <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
plot(X1(:,1),X1(:,2),'.b'); hold on;<br />
plot(X2(:,1),X2(:,2),'ob')<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
% Combine data sets to map both into the same subspace<br />
X=[X1;X2];<br />
X=X';<br />
% We used built-in PCA function in Matlab<br />
[coefs, scores]=princomp(X);<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=sw\[4; 2] % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is very little overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
===Some of FDA applications===<br />
There are many applications for FDA in many domains some of them are stated below:<br />
<br />
* SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS<br />
FDA can be used to enhance listening comprehension when the user goes from a sound<br />
environment to another different one. For more information review this paper by Alexandre et al.[http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/papers/1569101740.pdf here]<br />
<br />
* Application to Face Recognition<br />
FDA can be used in face recognition at different situation. Using FDA Kong et al. proposes an Application to Face<br />
Recognition with Small Number of Training Samples [http://person.hst.aau.dk/pimuller/2D_FDA_Face_CVPR05fish.pdf here].<br />
<br />
* Palmprint Recognition<br />
FDA is used in biometrics, to implement an automated palmprint recognition system. See An Automated Palmprint Recognition System by Tee et al. [http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6V09-4FJ5XPN-1-1&_cdi=5641&_user=1067412&_pii=S0262885605000089&_origin=search&_coverDate=05%2F01%2F2005&_sk=999769994&view=c&wchp=dGLbVzz-zSkWb&md5=a064b67c9bdaaba7e06d800b6c9b209b&ie=/sdarticle.pdf here].<br />
<br />
{{Cleanup|date=October 2010|reason=I think briefing about the other applications would be easier than browsing through all of these applications}}<br />
<br />
{{Cleanup|date=October 2010|reason= This link is no longer valid.}}<br />
<br />
other applications could found in references 4,5,6,7,8 and more in [http://www.sciencedirect.com/science?_ob=ArticleListURL&_method=list&_ArticleListID=1489148820&_sort=r&_st=13&view=c&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=f210273546a659c90ae0962fce7b8b4e&searchtype=a here]<br />
<br />
=== '''References'''===<br />
1. Kong, H.; Wang, L.; Teoh, E.K.; Wang, J.-G.; Venkateswarlu, R.; , "A framework of 2D Fisher discriminant analysis: application to face recognition with small number of training samples," Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , vol.2, no., pp. 1083- 1088 vol. 2, 20-25 June 2005<br />
doi: 10.1109/CVPR.2005.30<br />
[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1467563&isnumber=31473 1]<br />
<br />
2. Enrique Alexandre, Roberto Gil-Pita, Lucas Cuadra, Lorena A´lvarez, Manuel Rosa-Zurera, "SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS USING A TWO-LAYER CLASSIFICATION SYSTEM WITH MSE LINEAR DISCRIMINANTS", 16th European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland, August 25-29, 2008, copyright by EURASIP, [http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/welcome.html 2]<br />
<br />
3. Connie, Tee; Jin, Andrew Teoh Beng; Ong, Michael Goh Kah; Ling, David Ngo Chek; "An automated palmprint recognition system", Journal of Image and Vision Computing, 2005. [http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V09-4FJ5XPN-1&_user=1067412&_coverDate=05/01/2005&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_searchStrId=1489147048&_rerunOrigin=google&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=a781a68c29fbf127473ae9baa5885fe7&searchtype=a 3]<br />
<br />
4. met, Francesca; Boqué, Ricard; Ferré, Joan; "Application of non-negative matrix factorization combined with Fisher's linear discriminant analysis for classification of olive oil excitation-emission fluorescence spectra", Journal of Chemometrics and Intelligent Laboratory Systems, 2006.<br />
[http://www.sciencedirect.com/science/article/B6TFP-4HR769Y-1/2/b5244d459265abb3a1bf5238132c737e 4]<br />
<br />
5. Chiang, Leo H.;Kotanchek, Mark E.;Kordon, Arthur K.; "Fault diagnosis based on Fisher discriminant analysis and support vector machines"<br />
Journal of Computers & Chemical Engineering, 2004<br />
[http://www.sciencedirect.com/science/article/B6TFT-4B4XPRS-1/2/bca7462236924d29ea23ec633a6eb236 5]<br />
<br />
6. Yang, Jian ;Frangi, Alejandro F.; Yang, Jing-yu; "A new kernel Fisher discriminant algorithm with application to face recognition", 2004<br />
[http://www.sciencedirect.com/science/article/B6V10-4997WS1-1/2/78f2d27c7d531a3f5faba2f6f4d12b45 6]<br />
<br />
7. Cawley, Gavin C.; Talbot, Nicola L. C.; "Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers", Journal of Pattern Recognition , 2003 [http://www.sciencedirect.com/science/article/B6V14-492718R-1/2/bd6e5d0495023a1db92ab7169cc96dde 7]<br />
<br />
8. Kodipaka, S.; Vemuri, B.C.; Rangarajan, A.; Leonard, C.M.; Schmallfuss, I.; Eisenschenk, S.; "Kernel Fisher discriminant for shape-based classification in epilepsy" Hournal Medical Image Analysis, 2007. [http://www.sciencedirect.com/science/article/B6W6Y-4MH8BS0-1/2/055fb314828d785a5c3ca3a6bf3c24e9 8]<br />
<br />
==Fisher's (Linear) Discriminant Analysis (FDA) - Multi-Class Problem - October 7, 2010==<br />
<br />
====Obtaining Covariance Matrices====<br />
<br />
<br />
The within-class covariance matrix <math>\mathbf{S}_{W}</math> is easy to obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between-class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to compute directly. To bypass this problem we use the following method. We know that the total covariance <math>\,\mathbf{S}_{T}</math> of a given set of data is constant and known, and we can also decompose this variance into two parts: the within-class variance <math>\mathbf{S}_{W}</math> and the between-class variance <math>\mathbf{S}_{B}</math> in a way that is similar to [http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA]. We thus have:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
where the total variance is given by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = <br />
\frac{1}{n}<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
We can now get <math>\mathbf{S}_{B}</math> from the relationship: <br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
<br />
Actually, there is another way to obtain <math>\mathbf{S}_{B}</math>. Suppose the data contains <math>\, k </math> classes, and each class <math>\, j </math> contains <math>\, n_{j} </math> data points. We denote the overall mean vector by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\frac{1}{n} \sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within-class covariance <math>\mathbf{S}_{W}</math><br />
and the between-class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term in the final line of the derivation above as the between-class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^* =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})^{T})<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))((\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W})<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the following as our measure:<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to the largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
{{Cleanup|date=What if we encounter complex eigenvalues? Then concept of being large does not dense. What is the solution in that case? }}<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
<br />
Recall that the Frobenius norm of <math>X</math> is <br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2_{2} = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have the following classic criterion function that Fisher used<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to the largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
{{Cleanup|date=October 2010|reason=Adding more general comments about the advantages and flaws of FDA would be effective here.}}<br />
<br />
{{Cleanup|date=October 2010|reason=Would you please show how could we reconstruct our original data from the data that its dimentionality is reduced by FDA.}}<br />
{{Cleanup|date=October 2010|reason= When you reduce the dimensionality of data in most general form you lose some features of the data and you cannot reconstruct the data from redacted space unless the data have special features that help you in reconstruction like sparsity. In FDA it seems that we cannot reconstruct data in general form using reducted version of data }}<br />
<br />
===Generalization of Fisher's Linear Discriminant Analysis ===<br />
<br />
Fisher's Linear Discriminant Analysis (Fisher, 1936) is very popular among users of discriminant analysis. Some of the reasons for this are its simplicity<br />
and lack of necessity for strict assumptions. However, it has optimality properties only if the underlying distributions of the groups are multivariate normal. It is also easy to verify that the discriminant rule obtained can be very harmed by only a small number of outlying observations. Outliers are very hard to detect in multivariate data sets and even when they are detected simply discarding them is not the most efficient way of handling the situation. Therefore, there is a need for robust procedures that can accommodate the outliers and are not strongly affected by them. Then, a generalization of Fisher's linear discriminant algorithm [[http://www.math.ist.utl.pt/~apires/PDFs/APJB_RP96.pdf]] is developed to lead easily to a very robust procedure.<br />
<br />
Also notice that LDA can be seen as a dimensionality reduction technique. In general k-class problems, we have k means which lie on a linear subspace with dimension k-1. Given a data point, we are looking for the closest class mean to this point. In LDA, we project the data point to the linear subspace and calculate distances within that subspace. If the dimensionality of the data, d, is much larger than the number of classes, k, then we have a considerable drop in dimensionality from d dimensions to k - 1 dimensions.<br />
<br />
==Linear and Logistic Regression - October 12, 2010==<br />
<br />
===Linear Regression===<br />
Linear regression is an approach for modeling the scalar value <math>\, y</math> from a set of dependent variables <math>\,X</math>. In linear regression the goal is to find an appropriate set of dependent variables to <math>\, y</math> and try to estimate its value from the related set. While in classification the goal is to classify data to different groups in which the inner similarity among the group members are more than variables which belong to different groups.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
According to Bayes Classification we estimate the posterior as,<br/><br />
<math>P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{l}f_{l}(x)\pi_{l}}</math><br/><br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The simple linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
y_i = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
and we can denote it as<br />
:<math><br />
\begin{align}<br />
\mathbf{y} = \beta^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
where <math>\,\beta^{T} = (<br />
\beta_1,..., \beta_{d},\beta_0)</math> is a <math>1 \times (d+1)</math> vector and <math>\mathbf{X}=<br />
\begin{pmatrix}<br />
\mathbf{x}_{1}, \dots,\mathbf{x}_{n}\\<br />
1, \dots, 1<br />
\end{pmatrix}<br />
</math> is a <math>(d+1) \times n</math> matrix, here <math>\mathbf{x}_{i} </math> is a <math>d \times 1</math> vector<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{n}</math> and <math>\,y_{1}, ..., y_{n}</math> our goal is to find <math>\,\beta^{T}</math> such that the linear model fits the data while minimizing sum of squared errors using the [http://en.wikipedia.org/wiki/Least_squares Least Squares method].<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
We then try to minimize the residual sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\beta^{T}\mathbf{X})^{T}(\mathbf{y}-\beta^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}(\mathbf{y}-\beta^{T}\mathbf{X})^{T}<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}(\mathbf{y}^{T}-\mathbf{X}^{T}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}\mathbf{y}^{T}<br />
\end{align}<br />
</math><br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \hat\beta^{T}\mathbf{X} = <br />
\mathbf{y}\mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X} </math> is called the [http://en.wikipedia.org/wiki/Hat_matrix hat matrix].<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{l}f_{l}(x)\pi_{l}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1. This is more direct approach to classification since it do not need to estimate <math>\ f_k(x) </math> and <math>\ \pi_k </math>.<br />
<math>\ 1 \times P(Y=1|X=x)+0 \times P(Y=0|X=x)=E(Y|X) </math>. <br />
This model does not classify Y between 0 and 1, so it is not good but at times it can lead to a decent classifier. <math>\ y_i=\frac{1}{n_1} </math> <math>\ \frac{-1}{n_2} </math><br />
<br />
===Logistic Regression===<br />
The [http://en.wikipedia.org/wiki/Logistic_regression logistic regression] model arises from the desire to model the posterior probabilities of the <math>\displaystyle K</math> classes via linear functions in <math>\displaystyle x</math>, while at the same time ensuring that they sum to one and remain in [0,1].Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood ,using <math>\displaystyle Pr(Y|X)</math>. Since <math>\displaystyle Pr(Y|X)</math> completely specifies the conditional distribution, the multinomial distribution is appropriate. This model is widely used in biostatistical applications for two classes. For instance: people survive or die, have a disease or not, have a risk factor or not.<br />
<br />
==== logistic function ====<br />
[[File:200px-Logistic-curve.svg.png | Logistic Sigmoid Function]]<br />
<br />
<br />
<br />
A [http://en.wikipedia.org/wiki/Logistic_function logistic function] or logistic curve is the most common sigmoid curve. <br />
<br />
1. <math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
2. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
3. <math>y(0) = \frac{1}{2}</math><br />
<br />
4. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
5. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
====Intuition behind Logistic Regression====<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
====The Logistic Regression Model====<br />
<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
{{Cleanup|date=October 13 2010|reason=It could be useful to have sources for these graphs. We don't know if they are copyrighted}}<br />
{{Cleanup|date=October 18 2010|reason=I Could not find any source for these graphs. However, they following the definition of the defined probability. I don't think the generated graph as it is here is copyrighted, but if you worried you can draw this figure by applying the function and post the result.}}<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
{{Cleanup|date=October 13 2010|reason=It could be useful to have sources for these graphs. We don't know if they are copyrighted}}<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
====Fitting a Logistic Regression====<br />
Logistic regression tries to fit a distribution. The common practice in statistics is to fit density function, posterior density of each class(Pr(Y|X), to data using [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence and identical distribution)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
{{Cleanup|date=October 13 2010|reason=I think, in the following, y_i * x_i and the single x_i on the right side should both be transposed by matrix calculus?}}<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math> <br />
<br />
There are n+1 nonlinear equations in <math> \beta </math>. The first column is vector 1, then <math>\ \sum_{i=1}^n {y_i} =\sum_{i=1}^n p(\underline{x}_i;\underline{\beta}) </math> i.e. the expected number of class ones matches the observed number.<br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
====Extension====<br />
<br />
* When we are dealing with a problem with more than two classes, we need to generalize our logistic regression to a [http://en.wikipedia.org/wiki/Multinomial_logit Multinomial Logit model].<br />
<br />
* Limitations of Logistic Regression:<br />
:1. We know that there is no assumptions made about the distributions of the features of the data (i.e. the explanatory variables). However, the features should not be highly correlated with one another because this could cause problems with estimation.<br />
:2. Large number of data points (i.e.the sample sizes) are required for logistic regression to provide sufficient numbers in both classes. The more number of features/dimensions of the data, the larger the sample size required.<br />
<br />
==Lecture summary==<br />
{{Cleanup|date=October 18 2010|reason=Can anybody provide a better lecture summary? The one below is to just get it started}}<br />
In this lecture an introduction of the linear regression was presented as well as defining the density function for two-class problem. Maximum likelihood was used to define the distribution parameters (i.e. fitting density function to the logistic class.<br />
<br />
== Logistic Regression Cont. - October 14, 2010 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Estimating Parameters <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
'''Newton-Raphson Algorithm:'''<br /><br />
<br />
If we want to find <math>\ x^* </math> such that <math>\ f(x^*)=0</math><br />
<br />
We first pick a starting point <math>x^* = x^{old}</math> and and we solve:<br />
<br \><br />
<br />
<math>\ x^{*} \leftarrow x^{old}-\frac {f(x^{old})}{\partial f(x^{old})} </math> <br /><br />
<math> \ x^{old} \leftarrow x^{*}</math> <br />
<br /><br />
This is repeated till convergence <br />
<br />
If we want to maximize or minimize <math>\ f(x) </math>, then solve for <math>\ \partial f(x)=0 </math><br />
<br />
<math>\ X^{new} \leftarrow x^{old}-\frac {\partial f(x^{old})}{\partial^2 f(x^{old})} </math><br />
<br />
<br /><br />
<br />
In vector notation the above can be written as <br /><br />
<br />
<math><br />
X^{new} \leftarrow X^{old} - H^{-1}\Delta<br />
</math><br />
<br /><br />
H is the [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix] or second derivative matrix and <math>\Delta</math> is the gradient both evaluated at <math>X^{old}</math> <br />
<br /><br />
<br />
'''note:''' If the Hessian is not invertible the [http://en.wikipedia.org/wiki/Generalized_inverse generalized inverse] or pseudo inverse can be used<br />
<br /><br />
<br /><br />
<br />
<br />
As shown above ,the [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or Hessian.<br />
<br />
<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}_i))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{(d+1)}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i,i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^{T}\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
This is an adjusted response and it is solved repeatedly when <math>\ p </math>, <math>\ W </math>, and <math>\ z </math> changes. This algorithm is called [http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares iteratively reweighted least squares] because it solves the weighted least squares problem repeatedly.<br />
<br />
Recall that linear regression by least squares finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}^{T}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow arg \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta}^T)</math><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_{i,i}</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They both attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They both have linear boundaries.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1. There exists a closed form solution for least squares.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1. No closed form solution exists<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quadratically w.r.t dimension.<br />
#LDA estimate parameters more efficiently by using more information about data and samples without class labels can be also used in LDA. <br />
<br />
{{Cleanup|date=October 2010|reason= Could somebody please validate the following points}} <br />
{{Cleanup|date=October 2010|reason= I'm not too sure about the first point either, but it seems reasonable to me. Would be great if someone can confirm this point. Thanks}} <br />
<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust. (For high dimensionality logistic regression is more accommodating)<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
#Logistic regression is more robust, because it does not assume normal distribution regarding each independent variable.<br />
<br />
Many other advantages of logistic regression are explained [http://www.statgun.com/tutorials/logistic-regression.html here].<br />
<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>\,(d+1)</math><math>\,\times</math><math>\,(k-1)</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
===Lecture Summary===<br />
<br />
Traditionally logistic regression parameters are estimated using maximum likelihood. However , other optimization techniques may be used as well.<br />
<br /><br />
Since there is no closed form solution for finding the zero of the first derivative of the log likelihood the Newton Raphson algorithm is used. Since the problem is convex Newtons is guaranteed to converge to a global optimum.<br />
<br /><br />
Logistic regression requires less parameters than LDA or QDA and is therefore more favorable for high dimensional data.<br />
<br />
===Supplements===<br />
<br />
A detailed proof that logistic regression is convex is available [http://people.csail.mit.edu/jrennie/writing/convexLR.pdf here]. See '1 Binary LR' for the case we discussed in lecture.<br />
<br />
== '''Multi-Class Logistic Regression & Perceptron - October 19, 2010''' ==<br />
<br />
=== Lecture Summary ===<br />
<br />
In this lecture, the topic of logistic regression was finalized by covering the multi-class logistic regression and a new topic on perceptron was introduced. Perceptron is a linear classifier for two-class problems. The main goal of perceptron is classify data in 2 classes by minimizing the distances between the misclassified points and the decision boundary. This will be continued in the following lectures.<br />
<br />
=== Multi-Class Logistic Regression ===<br />
Recall that in two-class logistic regression, the posterior probability of one of the classes (say class 0) is modeled by a function in the form shown in figure 1. <br />
<br />
The posterior probability of the second class (say class 1) is the complement of the first class (class 0). <br /><br /><br />
<math>\displaystyle P(Y=0 | X=x) = 1 - P(Y=1 | X=x)</math><br /><br />
<br />
This function is called sigmoid logistic function, which is the reason why this algorithm is called "logistic regression".<br />
[[File:Picture1.png|150px|thumb|right|<math>Fig.1: P(Y=1 | X=x)</math>]]<br />
<br />
<math>\displaystyle \sigma\,\!(a) = \frac {e^a}{1+e^a} = \frac {1}{1+e^{-a}}</math><br /><br /><br />
<br />
In two-class logistic regression, we compare the posterior of one class to the other one using this ratio:<br /><br />
<br />
:<math> \frac{P(Y=1|X=x)}{P(Y=0|X=x)}</math><br /><br />
<br />
If we look at the natural logarithm of this ratio, we find that it is always a linear function in <math>x</math>:<br /><br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\underline{\beta}^T\underline{x} \quad \rightarrow (*)</math> <br /><br /><br />
<br />
What if we have more than two classes?<br /><br />
<br />
Using (*), we can extend the notion of logistic regression for the cases where we have more than two classes.<br /><br />
<br />
Assume we have <math>k</math> classes. Looking at the logarithm of the ratio of posteriors of each class and the k<sup>th</sup> class, we have: <br /><br />
<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_1}^T\underline{x} </math> <br /><br />
:<math>\log\left(\frac{P(Y=2|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_2}^T\underline{x} </math> <br /><br />
::::<math> \vdots</math><br /><br />
:<math>\log\left(\frac{P(Y=k-1|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_{k-1}}^T\underline{x} </math> <br /><br />
<br />
<br />
Although in the above posterior ratios, the denominator is chosen to be the posterior of the last class (class k), the choice of denominator is arbitrary in that the posterior estimates are equivariant under this choice - [http://www.springerlink.com/content/t45k620382733r71/ Linear Methods for Classification].<br /><br /><br />
<br />
Each of these functions is linear in <math>x</math>, however, we have different <math>\underline{\,\beta_{i}}</math>'s. We have to make sure that, the densities assigned to different classes sum to one.<br /><br /><br />
<br />
In general, we can write:<br />
<br /><math>P(Y=c | X=x) = \frac{e^{\underline{\beta_c}^T \underline{x}}}{1+\sum_{l=1}^{k-1}e^{\underline{\beta_l}^T \underline{x}}},\quad c \in \{1,\dots,k-1\} </math><br /><br />
<br /><math>P(Y=k | X=x) = \frac{1}{1+\sum_{l=1}^{k-1}e^{\underline{\beta_l}^T \underline{x}}}</math><br /><br />
These posteriors clearly sum to one. <br /><br /><br />
<br />
In the case of two-class problem, it is pretty simple to find <math>\beta</math> parameter (the <math>\beta</math> in two-class linear regression problems has <math>(d+1)\times1</math> dimension), as mentioned in previous lectures. In the multi-class case the iterative Newton method can be used, but here <math>\beta</math> is of size <math>(d+1)\times(k-1)</math> and the weight matrix W is a dense and non-diagonal matrix. This results in computationally inefficient, however feasible-to-be-solved algorithm. A trick would be to re-parametrize the logistic regression problem by expanding the input vector <math>x</math> (Question.4 in assignment no.2).<br />
<br /><br /><br />
<br />
===Nueral Network Concept===<br />
The concept of constructing an artificial neural network comes from scientists who like to simulate human neural network in their computers. They were trying to create computer programs that can learn like people. Neural network is a method in artificial intelligence which is a simplified models of neural processing in the brain, even though the relation between this model and brain biological architecture is not cleared yet.<br />
<br />
=== Perceptron ===<br />
<br />
Perceptron is a building block of Neural Networks. [http://en.wikipedia.org/wiki/Perceptron Perceptron] was invented in 1957 by [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt]. It is the basic building block of feedforward neural networks<br /><br /><br />
<br />
We know that least square obtained by regression of -1/1 response variable <math>\displaystyle Y</math> on observation <math>\displaystyle x</math>, lead to same coefficients as LDA. Recall that LDA minimizes the distance between discriminant function (decision boundary) and the data points. Least Square returns the sign of the linear combination of features as the class labels (figure 2). This was called perceptron in Engineering literature during 1950's. <br /><br /><br />
<br />
[[File:Perceptron.jpg|371px|thumb|right| Fig.2 Diagram of a linear perceptron ]]<br />
<br />
There is a cost function <math>\displaystyle D</math> that perceptron tries to minimize:<br /><br />
<br />
<math>D(\underline{\beta},\beta_0)=-\sum_{i \in M}y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math><br /><br />
<br />
where <math>\displaystyle M</math> is a set of misclassified points. <br /><br />
<br />
This is basically minimizing the sum of distances between the misclassified points and the decision boundary.<br /><br /><br />
<br />
'''Derivation''':'' The distances between the misclassified points and the decision boundary''.<br /><br /><br />
<br />
Consider points <math>\underline{x_1}</math>, <math>\underline{x_2}</math> and a decision boundary defined as <math>\underline{\beta}^T\underline{x}+\beta_0</math> as shown in figure 3.<br /><br />
<br />
[[File:DB.jpg|248px|thumb|right| Fig.3 Distance from the decision boundary ]]<br />
<br />
Both <math>\underline{x_1}</math> and <math>\underline{x_2}</math> lie on the decision boundary, then we have:<br /><br />
<math>\underline{\beta}^T\underline{x_1}+\beta_0=0 \rightarrow (1)</math><br /><br />
<math>\underline{\beta}^T\underline{x_2}+\beta_0=0 \rightarrow (2)</math><br /><br />
<br />
From (1) and (2):<br /><br />
<math>\underline{\beta}^T(\underline{x_2}-\underline{x_1})=0</math><br /><br />
<br />
Therefore, <math>\displaystyle \underline{\beta}</math> is orthogonal to <math>\underline{x_2}-\underline{x_1}</math> which is in the same direction with the decision boundary, which means that <math>\displaystyle \underline{\beta}</math> is orthogonal to the decision boundary. <br /><br />
<br />
Then the distance of a point <math>\underline{x_0}</math> from the decision boundary is: <br /><br />
<br />
<math>\underline{\beta}^T(\underline{x_0}-\underline{x_2})</math><br /><br />
<br />
From (2): <br /><br />
<br />
<math>\underline{\beta}^T\underline{x_2}= -\beta_0</math>. <br /><br />
<math>\underline{\beta}^T(\underline{x_0}-\underline{x_2})=\underline{\beta}^T\underline{x_0}-\underline{\beta}^T\underline{x_2}=\underline{\beta}^T\underline{x_0}+\beta_0</math><br /><br />
<br />
Therefore, distance between any point <math>\underline{x_{i}}</math> to the discriminant hyperplane is defined by <math>\underline{\beta}^T\underline{x_{i}}+\beta_0</math>.<br /><br /><br />
<br />
However, this quantity is not always positive. Considering <math>y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math>, if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive, since both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> are positive or both are negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'', then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> is positive and the other one is negative; hence, the product <math>y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math> will be negative for a misclassified point. The "-" sign in <math>D(\underline{\beta},\beta_0)</math> makes this cost function always positive. <br /><br /><br />
<br />
==Perceptron Learning Algorithm and Feed Forward Neural Networks - October 21, 2010 ==<br />
===Lecture Summary===<br />
In this lecture, we finalize our discussion of the Perceptron by reviewing its learning algorithm, which is based on gradient descent. We then begin the next topic, Neural Networks (NN), and we focus on a NN that is useful for classification: the Feed Forward Neural Network (FFNN). The mathematical model for the FFNN is shown, and we review one of its most popular learning algorithms: Back-Propagation. <br />
<br />
To open the Neural Network discussion, we present a formulation of the universal function approximator. The mathematical model for Neural Networks is then built upon this formulation. We also discuss the trade-off between training error and testing error -- known as the generalization problem -- under the universal function approximator section.<br />
<br />
===Perceptron===<br />
The last lecture introduced the Perceptron and showed how it can suggest a solution for the 2-class classification problem. We saw that the solution requires minimization of a cost function, which is basically a summation of the distances of the misclassified data points to the separating hyperplane. This cost function is<br />
<br />
<math>D(\underline{\beta},\beta_0)=-\sum_{i \in M}y_{i}(\underline{\beta}^T \underline{x}_i+\beta_0),</math><br />
<br />
in which, <math>\,M</math> is the set of misclassified points. Thus, the objective is to find <math>\arg\min_{\underline{\beta},\beta_0} D(\underline{\beta},\beta_0)</math>.<br />
<br />
====Perceptron Learning Algorithm====<br />
To minimize <math>D(\underline{\beta},\beta_0)</math>, an algorithm that uses gradient-descent has been suggested. Gradient descent, also known as steepest descent, is a numerical optimization technique that starts from an initial value for <math>(\underline{\beta},\beta_0)</math> and recursively approaches an optimal solution. Each step of recursion updates <math>(\underline{\beta},\beta_0)</math> by subtracting from it a factor of the gradient of <math>D(\underline{\beta},\beta_0)</math>. Mathematically, this gradient is<br />
<br />
<math>\nabla D(\underline{\beta},\beta_0)<br />
= \left( \begin{array}{c}\cfrac{\partial D}{\partial \underline{\beta}} \\ \\ <br />
\cfrac{\partial D}{\partial \beta_0} \end{array} \right)<br />
= \left( \begin{array}{c} -\displaystyle\sum_{i \in M}y_{i}\underline{x}_i^T \\ <br />
-\displaystyle\sum_{i \in M}y_{i} \end{array} \right)</math><br />
<br />
However, the perceptron learning algorithm does not use the sum of the contributions from each observation to calculate the gradient for each step. Instead, each step uses the gradient contribution from only a single observation, and each successive step uses a different observation. This slight modification is called stochastic gradient descent. That is, instead of subtracting some factor of <math>\nabla D(\underline{\beta},\beta_0)</math> at each step, we subtract a factor of<br />
<br />
<math>\left( \begin{array}{c} y_{i}\underline{x}_i \\ <br />
y_{i} \end{array} \right)</math><br />
<br />
As a result, the pseudo code for the Perceptron Learning Algorithm is as follows:<br />
<br />
:1) Choose a random initial value for <math>(\underline{\beta},\beta_0)</math>.<br />
<br />
:2) <math>\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{old}}\\<br />
\beta_0^{\mathrm{old}}<br />
\end{pmatrix}<br />
\leftarrow<br />
\begin{pmatrix}<br />
\underline{\beta}^0\\<br />
\beta_0^0<br />
\end{pmatrix}</math><br />
<br />
:3) <math>\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{new}}\\<br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
\leftarrow<br />
\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{old}}\\<br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix}<br />
y_i \underline{x_i}\\<br />
y_i<br />
\end{pmatrix}</math> for some <math>\,i \in M</math>.<br />
<br />
:4) If the termination criterion has not been met, go back to step 3 and use a different observation datapoint (i.e. a different <math>\,i</math>).<br />
<br />
The learning rate <math>\,\rho</math> controls the step size of convergence toward <math>\min_{\underline{\beta},\beta_0} D(\underline{\beta},\beta_0)</math>. A larger value for <math>\,\rho</math> causes the steps to be larger. If <math>\,\rho</math> is set to be too large, however, then the minimum could be missed (over-stepped).<br />
In practice, <math>\rho</math> can be adaptive and not fixed, it means that, in the first steps <math>\rho</math> could be larger than the last steps. At the beginning, larger <math>\rho</math> helps to find the approximate answer sooner. And smaller <math>\rho</math> in last steps help to tune the final answer more accurately. <br />
<br />
<br />
As mentioned earlier, the learning algorithm uses just one of the data points at each iteration; this is the common practice when dealing with online applications. In an online application, datapoints are accessed one-at-a-time because training data is not available in batch form. The learning algorithm does not require the derivative of the cost function with respect to the previously seen points; instead, we just have to take into consideration the effect of each new point.<br />
<br />
One way that the algorithm could terminate is if there are no more mis-classified points (i.e. if set <math>\,M</math> is empty. As long as there are points in <math>\,M</math>, the algorithm continues until some other termination criterion is reached. Termination criterion for an optimization algorithm is usually convergence, but for numerical methods this is not well-defined. In theory, convergence is realized when the gradient of the cost function is zero; in numerical methods an answer close to zero within some margin of error is taken instead.<br />
<br />
Since the data is linearly-separable, the solution is theoretically guaranteed to converge in a finite number of iterations. This number of iterations depends on the <br />
<br />
* learning rate <math>\,\rho</math><br />
<br />
* initial value <math>(\underline{\beta},\beta_0)</math><br />
<br />
* difficulty of the problem. The problem is more difficult if the gap between the classes of data is very small.<br />
<br />
Note that we consider the offset term <math>\beta_0</math> separately from the <math>\underline{\beta}</math> to distinguish this formulation from those in which the direction of the hyperplane (<math>\underline{\beta}</math>) has been considered.<br />
<br />
A major concern about gradient descent is that it may get trapped in local optimal solutions.<br />
<br />
====Some notes on the Perceptron Learning Algorithm====<br />
<br />
* If there is access to the training data points in a batch form, we should better take advantage of a closed optimization technique like least-squares or maximum-likelihood estimation for linear classifiers. (These closed solutions has been around many years before invention of the Perceptron).<br />
<br />
* Just like the linear classifier, a Perceptron can discriminate between only two classes at a time, and one can generalize its performance for multi-class problems by using one of the <math>k-1</math>, <math>k</math>, or <math>k(k-1)/2</math>-hyperplane methods.<br />
<br />
* If the two classes are linearly separable, the algorithm will converge in a finite number of iterations to a hyperplane, which makes the error of training data zero. The convergence is guaranteed if the learning rate is set adequately.<br />
<br />
* If the two classes are not linearly separable, the algorithm will never converge. So, one may think of a termination criterion in these cases. (e.g. a maximum number of iterations in which convergence is expected, or the rate of changes in both a cost function and its derivative).<br />
<br />
* In the case of linearly separable classes, the final solution and number of iterations will be dependent on the initial conditions, learning rate, and the gap between the two classes. In general, a smaller gap between classes requires a greater number of iterations for the algorithm to converge.<br />
<br />
* Learning rate --or updating step-- has a direct impact on both number of iterations and the accuracy of the solution for the optimization problem. Smaller quantities for this factor make convergence slower, even though we will end up with a more accurate solution. In the opposite way, larger values for learning rate make the process faster, even though we may lose some precision. So, one may make a balance for this trade-off in order to get fast enough to an accurate enough solution. (exploration vs. exploitation)<br />
<br />
In the upcoming lectures, we introduce the Support Vector Machines (SVM), which use a method similar in iterational optimization scheme to what the Perceptron suggests, but have a different definition for the cost function.<br />
<br />
===Universal Function Approximator===<br />
The universal function approximator is a mathematical formulation for a group of estimation techniques. The usual formulation for it is<br />
<br />
<math>\hat{Y}(x)=\sum\limits_{i=1}^{n}\alpha_i\sigma(\omega_i^Tx+b_i),</math><br />
<br />
where <math>\hat{Y}(x)</math> is an estimation for a function like <math>\,Y(x)</math>. According to the universal approximation theorem we have<br />
<br />
<math>|\hat{Y}(x) - Y(x)|<\epsilon,</math><br />
<br />
which means that <math>\hat{Y}(x)</math> can get as close to <math>\,Y(x)</math>, as necessary.<br />
<br />
This formulation assumes that the output, <math>\,Y(x)</math>, is a linear combination of a set of functions like <math>\,\sigma(.)</math> where <math>\,\sigma(.)</math> is a nonlinear function of the inputs or <math>\,x_i</math>s.<br />
<br />
====Generalization Factors====<br />
Even though this formulation represents a universal function approximator, which means that it can be fitted to a set of data as closely as demanded, the closeness of fit must be carefully decided upon. In many cases, the purpose of the model is to target unseen data. However, the fit to this unseen data is impossible to determine before it arrives.<br />
<br />
To overcome this dilemma, a common practice is to divide the test data points into two sets: training data and validation data. We use the training data to estimate the fixed parameters for the model, and then use the validation data to find values for the construction-dependent parameters. How these construction-dependent parameters vary depends on the model. In the case of a polynomial, the construction-dependent parameter would be its highest degree, and for a neural network, the construction-dependent parameter could be the number of hidden layers and the number of neurons in each layer.<br />
<br />
These matters on model generalization vs. complexity matters will be discussed with more detail in the lectures to follow.<br />
<br />
===Feed-Forward Neural Network===<br />
The Neural Network (NN) is one application of the universal function approximator. It can be thought of as a system of Perceptrons linked together as units of a network. One particular NN useful for classification is the Feed-Forward Neural Network (FFNN), which consists of multiple "hidden layers" of Perceptron units. Our discussion here is based around the FFNN, which has a toplogy shown in Figure 1. The first hidden layer of units receive input from the original features. Between the hidden layers, connections from each unit are always directed to units in the next adjacent layer. The output layer, which receives input only from the last hidden layer, each unit produces a target measurement for a distinct class (i.e. <math>\,K</math> classes require <math>\,K</math> units). In Figure 1, the units in a single layer are distributed vertically, and the inputs and outputs of the network are shown as the far left and right layers respectively.<br />
<br />
[[File:FFNN.png|300px|thumb|right|Fig.1 A common architecture for the FFNN]]<br />
<br />
====Mathematical Model of the FFNN with One Hidden Layer====<br />
The FFNN with one hidden layer for a <math>\,K</math>-class problem is defined as follows. Let <math>\,d</math> be the number of input features, <math>\,p</math> be the number of units in the hidden layer, and <math>\,K</math> be the number of classes (i.e. the number of units in the output layer).<br />
<br />
Each neural unit calculates its derived feature (i.e. output) using a linear combination of its inputs. Suppose <math>\,\underline{x}</math> is the <math>\,d</math>-dimensional vector of input features. Then, each neural unit uses a <math>\,d</math>-dimensional vector of weights to combine these input features: for the <math>\,i</math>th neural unit, let <math>\underline{u}_i</math> be this vector of weights. The linear combination calculated by the <math>\,i</math>th unit is then given by<br />
<br />
<math>a_i = \underline{u}_i^T\underline{x}</math><br />
<br />
However, we want the derived feature to lie between 0 and 1, so we apply an ''activating function'' <math>\,\sigma(a)</math>. The derived feature for the <math>\,i</math>th unit is then given by<br />
<br />
<math>\,z_i = \sigma(a_i)</math> where <math>\,\sigma</math> is typically the logistic function<br />
<br />
<math>\sigma(a) = \cfrac{1}{1+e^{-a}}</math><br />
<br />
Now, we place each of the derived features <math>\,z_i</math> from the hidden layer into a <math>\,p</math>-dimensional vector:<br />
<br />
<math>\underline{z} = \left[ \begin{array}{c} z_1 \\ z_2 \\ \vdots \\ z_p \end{array}\right]</math><br />
<br />
Like in the hidden layer, each unit in the output layer calculates its derived feature using a linear combination of its inputs. Each neural unit uses a <math>\,p</math>-dimensional vector of weights to combine the input features derived from the hidden layer. Let <math>\,\underline{w}_k</math> be this vector of weights used in the <math>\,k</math>th unit. The linear combination calculated by the <math>\,k</math>th unit is then given by<br />
<br />
<math>\hat{y}_k = \underline{w}_k^T\underline{z}</math><br />
<br />
<math>\,y_k</math> is thus the target measurement for the <math>\,k</math>th class. Note that an activation function <math>\,\sigma</math> is not used here.<br />
<br />
Notice that in each of the units, two operations take place:<br />
<br />
* a linear combination of the neuron's inputs is calculated using corresponding weights<br />
<br />
* a nonlinear operation on the linear combination is performed. <br />
<br />
These two calculations are shown in Figure 2. <br />
<br />
The nonlinear function <math>\,\sigma(.)</math> is called the activation function. Activation functions, like the logarithmic function shown earlier, are usually continuous and have a limited range. Another common activation function used in neural networks is <math>\,tanh(x)</math> (Figure 3).<br />
<br />
[[File:neuron2.png|300px|thumb|right|Fig.2 A general construction for a single neuron]]<br />
[[File:actfcn.png|300px|thumb|right|Fig.3 <math>tanh</math> as activation function]]<br />
<br />
The NN can be applied as a regression method or as a classifier, and the output layer differs depending on the application. The major difference between regression and classification is in the output space of the model, which is continuous in the case of regression, and discrete in the case of classification. For a regression task, no consideration is needed beyond what has already been mentioned earlier, since the outputs of the network would already be continuous. However, to use the neural network as a classifier, a threshold stage is necessary.<br />
<br />
====Mathematical Model of the FFNN with Multiple Hidden Layers====<br />
In the FFNN model with a single hidden layer, the derived features were represented as elements of the vector <math>\underline{z}</math>, and the original features were represented as elements of the vector <math>\underline{x}</math>. In the FFNN model with more than one hidden layer, <math>\underline{z}</math> is processed by the second hidden layer in the same way that <math>\underline{x}</math> was processed by the first hidden layer. Perceptrons in the second layer each use their own combination of weights to calculate a new set of derived features. These new derived features are processed by the third hidden layer in a similar way, and the cycle repeats for each additional hidden layer. This progression of processing is depicted in Figure 4.<br />
<br />
====Back-Propagation Learning Algorithm====<br />
<br />
[[File:bpl.png|300px|thumb|right|Fig.4 Labels for weights and derived features in the FFNN.]]<br />
<br />
Every linear-combination calculation in the FFNN involves weights that need to be set, and these weights are set using training data and an algorithm called Back-Propagation. This algorithm is similar to the gradient-descent algorithm introduced in the discussion of the Perceptron. The primary difference is that the gradient used in Back-Propagation is calculated in a more complicated way.<br />
<br />
First of all, we want to minimize the error between the estimated and true target measurements for the training data. That is, if <math>\,U</math> is the set of all weights in the FFNN, then we want to determine<br />
<br />
<math>\arg\min_U \left|y - \hat{y}\right|^2</math><br />
<br />
Now, suppose the hidden layers of the FFNN are labelled as in Figure 4. Then, we want to determine the derivative of <math>\left|y - \hat{y}\right|^2</math> with respect to each weight in the hidden layers of the FFNN. For weights <math>\,u_{jl}</math> this means we will need to find<br />
<br />
<math><br />
\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}}<br />
= \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j}\cdot<br />
\cfrac{\partial a_j}{\partial u_{jl}} = \delta_{j}z_l<br />
</math><br />
<br />
However, the closed-form solution for <math>\,\delta_{j}</math> is unknown, so we develop a recursive definition (<math>\,\delta_{j}</math> in terms of <math>\,\delta_{i}</math>):<br />
<br />
<math><br />
\delta_j = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j} <br />
= \sum_{i=1}^p \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_i}\cdot<br />
\cfrac{\partial a_i}{\partial a_j} <br />
= \sum_{i=1}^p \delta_i\cdot u_{ij} \cdot \sigma'(a_j)<br />
= \sigma'(a_j)\sum_{i=1}^p \delta_i \cdot u_{ij}<br />
</math><br />
<br />
We also need to determine the derivative of <math>\left|y - \hat{y}\right|^2</math> with respect to each weight in the ''output layer'' <math>\,k</math> of the FFNN (this layer is not shown in Figure 4, but it would be the next layer to the right of the rightmost layer shown). For weights <math>\,u_{ki}</math> this means<br />
<br />
<math><br />
\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{ki}}<br />
= \cfrac{\partial \left|y - \sum_i u_{ki}z_i\right|^2}{\partial u_{ki}}<br />
= -2(y - \sum_i u_{ki}z_i)z_i<br />
= -2(y - \hat{y})z_i<br />
</math><br />
<br />
With similarity to our computation of <math>\,\delta_j</math>, we define<br />
<br />
<math>\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_k}</math><br />
<br />
However, <math>\,a_k = \hat{y}</math> because an activation function is not applied in the output layer. So, our calculation becomes<br />
<br />
<math>\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial \hat{y}}<br />
= -2(y - \hat{y})</math><br />
<br />
Now that we have <math>\,\delta_k</math> and a recursive definition for <math>\,\delta_j</math>, it is clear that our weights can be deduced by starting from the output layer and working through the hidden layers through toward the input layer.<br />
<br />
Based on the above derivation, our algorithm for determining weights in the FFNN is as follows<br />
<br />
:1) Choose a random initial weights.<br />
<br />
:2) Apply a new datapoint <math>\underline{x}</math> to the FFNN as the input layer, and calculate the values for all units.<br />
<br />
:3) Compute <math>\,\delta_k = -2(y_k - \hat{y}_k)</math>.<br />
<br />
:4) Back-propagate layer-by-layer by computing <math>\delta_j = \sigma'(a_j)\sum_{i=1}^p \delta_i \cdot u_{ij}</math> for all units.<br />
<br />
:5) Compute <math>\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} = \delta_{j}z_l</math> for all weights <math>\,u_{jl}</math>.<br />
<br />
:6) Update <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}}<br />
- \rho \cdot \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} </math> for all weights <math>\,u_{jl}</math> where <math>\,\rho</math> is the learning rate.<br />
<br />
:7) If the termination criterion has not been met, go back to step 2 and apply another datapoint (ie. begin a new "epoch").<br />
<br />
====Alternative Description of the Back-Propagation Algorithm====<br />
Label the inputs and outputs of the <math>\,i</math>th hidden layer <math>\underline{x}_i</math> and <math>\underline{y}_i</math> respectively, and let <math>\,\sigma(.)</math> be the activation function for all neurons. We now have<br />
<br />
<math>\begin{align}<br />
\begin{cases}<br />
\underline{y}_1=\sigma(W_1.\underline{x}_1),\\<br />
\underline{y}_2=\sigma(W_2.\underline{x}_2),\\<br />
\underline{y}_3=\sigma(W_3.\underline{x}_3),<br />
\end{cases}<br />
\end{align}</math><br />
<br />
Where <math>\,W_i</math> is a matrix of the connection's weights, between two layers of <math>\,i</math> and <math>\,i+1</math>, and has <math>\,n_i</math> columns and <math>\,n_i+1</math> rows, where <math>\,n_i</math> is the number of neurons of the <math>\,i^{th}</math> layer.<br />
<br />
Considering this matrix equations, one can imagine a closed form for the derivative of the error in respect to the weights of the network. For a neural network with two hidden layers, the equations are as follows.<br />
<br />
<math>\begin{align}<br />
\frac{\partial E}{\partial W_3}=&diag(e).\sigma'(W_3.\underline{x}_3).(\underline{x}_3)^T,\\<br />
\frac{\partial E}{\partial W_2}=&\sigma'(W_2.\underline{x}_2).(\underline{x}_2)^T.diag\{\sum rows\{diag(e).diag(\sigma'(W_3.\underline{x}_3)).W_3\}\},\\<br />
\frac{\partial E}{\partial W_1}=&\sigma'(W_1.\underline{x}_1).(\underline{x}_1)^T.diag\{\sum rows\{diag(e).diag(\sigma'(W_3.\underline{x}_3)).W_3.diag(\sigma'(W_2.\underline{x}_2)).W_2\}\},<br />
\end{align}</math><br />
<br />
where <math>\,\sigma'(.)</math> is the derivative of the activation function <math>\,\sigma(.)</math>.<br />
<br />
Using this closed form derivative, it is possible to code the procedure for any number of layers and neurons. Here is a Matlab code for backpropagation algorithm. (<math>\,tanh</math> is utilized as the activation function.)<br />
<br />
<br />
while i < ep<br />
i = i + 1;<br />
data = shuffle(data,2);<br />
for j = 1:Q<br />
io = zeros(max(n)+1,length(n));<br />
gp = io;<br />
io(1:n(1)+1,1) = [1;data(1:f,j)];<br />
for k = 1:l<br />
io(2:n(k+1)+1,k+1) = w(2:n(k+1)+1,1:n(k)+1,k)*io(1:n(k)+1,k);<br />
gp(1:n(k+1)+1,k) = [0;1./(cosh(io(2:n(k+1)+1,k+1))).^2];<br />
io(1:n(k+1)+1,k+1) = [1;tanh(io(2:n(k+1)+1,k+1))];<br />
wg(1:n(k+1)+1,1:n(k)+1,k) = diag(gp(1:n(k+1)+1,k))*w(1:n(k+1)+1,1:n(k)+1,k);<br />
end<br />
e = [0;io(2:n(l+1)+1,l+1) - data(f+1:dd,j)];<br />
wg(1:n(l+1)+1,1:n(l)+1,l) = diag(e)*wg(1:n(l+1)+1,1:n(l)+1,l);<br />
gp(1:n(l+1)+1,l) = diag(e)*gp(1:n(l+1)+1,l);<br />
d = eye(n(l+1)+1);<br />
E(i) = E(i) + 0.5*norm(e)^2;<br />
for k = l:-1:1<br />
w(1:n(k+1)+1,1:n(k)+1,k) = w(1:n(k+1)+1,1:n(k)+1,k) - ro*diag(sum(d,1))*gp(1:n(k+1)+1,k)*(io(1:n(k)+1,k)');<br />
d = d*wg(1:n(k+1)+1,1:n(k)+1,k);<br />
end<br />
end<br />
end<br />
<br />
====Some notes on the neural network and its learning algorithm====<br />
<br />
* The activation functions are usually linear around the origin. If this is the case, choosing random weights between the <math>\,-0.5</math> and <math>\,0.5</math>, and normalizing the data may boost up the algorithm in the very first steps of the procedure, as the linear combination of the inputs and weights falls within the linear area of the activation function.<br />
<br />
* Learning of the neural network using backpropagation algorithm takes place in epochs. An Epoch is a single pass through the entire training set.<br />
<br />
* It is a common practice to randomly change the permutation of the training data in each one of the epochs, to make the learning independent of the data permutation.<br />
<br />
* Given a set of data for training a neural network, one should keep aside a ratio of it as the validation dataset, to obtain a sufficient number of layers and number of neurons in each of the layers. The best construction may be the one which leads to the least error for the validation dataset. Validation data may not be used as the training of the network.<br />
<br />
* We can also use the validation-training scheme to estimate how many epochs is enough for training the network.<br />
<br />
* It is also common to use other optimization algorithms as steepest descent and conjugate gradient in a batch form.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
==== Difficulties of training deep architecture <ref>{{Cite journal | title = Exploring Strategies for Training Deep Neural Networks | url = http://jmlr.csail.mit.edu/papers/volume10/larochelle09a/larochelle09a.pdf | year = 2009 | journal = Journal of Machine Learning Research | page = 1-40 | volume = 10 | last1 = Larochelle | first1 = H. | last2 = Bengio | first2 = Y. | last3 = Louradour | first3 = J. | last4 = Lamblin | first4 = P. }}</ref> ====<br />
<br />
Given a particular task, a natural way to train a deep network is to frame it as an optimization<br />
problem by specifying a supervised cost function on the output layer with respect to the desired<br />
target and use a gradient-based optimization algorithm in order to adjust the weights and biases<br />
of the network so that its output has low cost on samples in the training set. Unfortunately, deep<br />
networks trained in that manner have generally been found to perform worse than neural networks<br />
with one or two hidden layers.<br />
<br />
We discuss two hypotheses that may explain this difficulty. The first one is that gradient descent<br />
can easily get stuck in poor local minima (Auer et al., 1996) or plateaus of the non-convex training<br />
criterion. The number and quality of these local minima and plateaus (Fukumizu and Amari, 2000)<br />
clearly also influence the chances for random initialization to be in the basin of attraction (via<br />
gradient descent) of a poor solution. It may be that with more layers, the number or the width<br />
of such poor basins increases. To reduce the difficulty, it has been suggested to train a neural<br />
network in a constructive manner in order to divide the hard optimization problem into several<br />
greedy but simpler ones, either by adding one neuron (e.g., see Fahlman and Lebiere, 1990) or one<br />
layer (e.g., see Lengell´e and Denoeux, 1996) at a time. These two approaches have demonstrated to<br />
be very effective for learning particularly complex functions, such as a very non-linear classification<br />
problem in 2 dimensions. However, these are exceptionally hard problems, and for learning tasks<br />
usually found in practice, this approach commonly overfits.<br />
<br />
This observation leads to a second hypothesis. For high capacity and highly flexible deep networks,<br />
there actually exists many basins of attraction in its parameter space (i.e., yielding different<br />
solutions with gradient descent) that can give low training error but that can have very different generalization<br />
errors. So even when gradient descent is able to find a (possibly local) good minimum<br />
in terms of training error, there are no guarantees that the associated parameter configuration will<br />
provide good generalization. Of course, model selection (e.g., by cross-validation) will partly correct<br />
this issue, but if the number of good generalization configurations is very small in comparison<br />
to good training configurations, as seems to be the case in practice, then it is likely that the training<br />
procedure will not find any of them. But, as we will see in this paper, it appears that the type of<br />
unsupervised initialization discussed here can help to select basins of attraction (for the supervised<br />
fine-tuning optimization phase) from which learning good solutions is easier both from the point of<br />
view of the training set and of a test set.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
===Business Applications of Neural Networks===<br />
<br />
Neural networks are increasingly being used in real-world business applications and, in some cases, such as fraud detection, they have already become the method of choice. Their use for risk assessment is also growing and they have been employed to visualize complex databases for marketing segmentation. This method covers a wide range of business interests — from finance management, through forecasting, to production. The combination of statistical, neural and fuzzy methods now enables direct quantitative studies to be carried out without the need for rocket-science expertise.<br />
<br />
* On the Use of Neural Networks for Analysis Travel Preference Data <br />
* Extracting Rules Concerning Market Segmentation from Artificial Neural Networks <br />
* Characterization and Segmenting the Business-to-Consumer E-Commerce Market Using Neural Networks<br />
* A Neurofuzzy Model for Predicting Business Bankruptcy <br />
* Neural Networks for Analysis of Financial Statements <br />
* Developments in Accurate Consumer Risk Assessment Technology <br />
* Strategies for Exploiting Neural Networks in Retail Finance <br />
* Novel Techniques for Profiling and Fraud Detection in Mobile Telecommunications<br />
* Detecting Payment Card Fraud with Neural Networks<br />
* Money Laundering Detection with a Neural-Network <br />
* Utilizing Fuzzy Logic and Neurofuzzy for Business Advantage</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=7364stat841f102010-10-25T20:58:59Z<p>Hclam: /* Multi-Class Logistic Regression & Perceptron - October 19, 2010 */</p>
<hr />
<div>==[[Proposal Fall 2010]] ==<br />
==[[statf10841Scribe|Editor sign up]] ==<br />
{{Cleanup|date=October 8 2010|reason=Provide a summary for each topic here.}}<br />
== Summary ==<br />
=== Classification ===<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>.<br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>.<br />
=== Principle Component Analysis ===<br />
Principal component analysis (PCA) is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf]. Depending on where this methodology is applied, other common names of PCA include the [http://en.wikipedia.org/wiki/Karhunen%E2%80%93Lo%C3%A8ve_theorem Karhunen–Loève transform (KLT)] , the [http://en.wikipedia.org/wiki/Harold_Hotelling Hotelling transform], and the proper orthogonal decomposition (POD). PCA is the simplist [http://en.wikipedia.org/wiki/Eigenvector eigenvector]-based [http://en.wikipedia.org/wiki/Multivariate_analysis multivariate analysis]. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or '''principal components''') of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could.<br />
<br />
==[[f10_Stat841_digest |Digest ]] ==<br />
<br />
== ''' Reference Textbook''' ==<br />
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
== ''' Classification - September 21, 2010''' ==<br />
<br />
=== Classification ===<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which ones were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle (384 BC - 322 BC) when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression], [http://en.wikipedia.org/wiki/Cluster_analysis clustering], and [http://en.wikipedia.org/wiki/Dimension_reduction dimensionality reduction] (feature extraction or manifold learning). Please be noted that some people consider classification to be a broad area that consists of both supervised and unsupervised methods of classifying data. In this view, as can be seen in [http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm this link], clustering is simply a special case of classification and it may be called '''unsupervised classification'''.<br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers, a link to a source of which can be found [http://www.e-knowledge.ca/quotes.php?topic=Knowledge here].<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers <br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that can take a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
<br />
The '''true error rate''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
<br />
In practice, the empirical error rate is obtained to estimate the true error rate, whose value is impossible to be known because the parameter values of the underlying process cannot be known but can only be estimated using available data. The empirical error rate, in practice, estimates the true error rate quite well in that, as mentioned [http://www.liebertonline.com/doi/pdf/10.1089/106652703321825928 here], it is an unbiased estimator of the true error rate.<br />
<br />
=== Bayes Classifier ===<br />
<br />
{{Cleanup|date=October 14 2010|reason=In response to the previous tag: The naive Bayes classifier is a special (simpler) case of the Bayes classifier. It uses an extra assumption: that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. This assumption allows for an easier likelihood function <math>\,f_y(x)</math> in the equation:<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
The simper form of the likelihood function seen in the naive Bayes is:<br />
:<math><br />
\begin{align}<br />
f_y(x) = P(X=x|Y=y) = {\prod_{i=1}^{n} P(X_{i}=x_{i}|Y=y)}<br />
\end{align}<br />
</math><br />
The Bayes classifier taught in class was not the naive Bayes classifier. Perhaps a comment should be made about the naive Bayes classifier in the body of the text}}<br />
<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' Theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers [1]. Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests][2].<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into the most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class <math> y \in \mathcal{Y} </math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the posterior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h^*</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
The Bayes classifier is the optimal classifier in that it results in the least possible true probability of misclassification for any given new data input, i.e., for any generic classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief. As a result, the estimated values of the components in the trained model may deviate quite a bit from their true population values, and this can ultimately cause the calculated posterior probabilities of inputs to deviate quite a bit from their true values. Estimation of all these probability functions, as likelihood, prior probability, and evidence function is a very expensive task, computationally, which also makes some other classifiers more favorable than Bayes classifier.<br />
<br />
A detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here].<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h^*</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h^*)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
'''Theorem'''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course. ''Note: these are the known y values in the training data.'' <br />
<br />
These known data are summarized in the following tables:<br />
<br />
:[[File:裁剪.jpg]]<br />
{{Cleanup|date=September 2010|reason=this graph is not complete, the reason is that it should be in consistent with the computation below.}}<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
<br />
The Bayesian view of probability states that, for any event E, event E has a [http://en.wikipedia.org/wiki/Prior_probability prior probability] that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability of event E's occurrence, can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>. If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow. This is because one cannot possibly carry out trials for any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h^*)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the hyperplane the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and we shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that both of the two classes have [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distributions] and the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> to <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of <math>\,D(h^*)</math> is as follows:<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out alike terms and factoring).<br />
<br />
It is easy to see that, under LDA, the Bayes's classifier's decision boundary <math>\,D(h^*)</math> has the form <math>\,ax+b=0</math> and it is linear in <math>\,x</math>. This is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\left( \mu_m^\top\Sigma^{-1}<br />
\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n) \right)=0</math> . In addition, for any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between the centers (means) of <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice, where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
According to [http://www.lsv.uni-saarland.de/Vorlesung/Digital_Signal_Processing/Summer06/dsp06_chap9.pdf this link], some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that the data in each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - September 23, 2010''' ==<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>.<br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>.<br />
<br />
The derivation of the Bayes classifier's decision boundary <math>\,D(h^*)</math> under QDA is similar to that under LDA. Again, let us first consider the two-classes case where <math>\, \mathcal{Y}=\{0, 1\}</math>. This derivation is given as follows: <br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math> (by cancellation)<br />
:<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_1|)-\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1)+\log(\pi_1)=-\frac{1}{2}\log(|\Sigma_0|)-\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0)+\log(\pi_0)</math> (by taking the log of both sides)<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top\Sigma_1^{-1}x + \mu_1^\top\Sigma_1^{-1}\mu_1 - 2x^\top\Sigma_1^{-1}\mu_1 - x^\top\Sigma_0^{-1}x - \mu_0^\top\Sigma_0^{-1}\mu_0 + 2x^\top\Sigma_0^{-1}\mu_0 \right)=0</math> (by expanding out)<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top(\Sigma_1^{-1}-\Sigma_0^{-1})x + \mu_1^\top\Sigma_1^{-1}\mu_1 - \mu_0^\top\Sigma_0^{-1}\mu_0 - 2x^\top(\Sigma_1^{-1}\mu_1-\Sigma_0^{-1}\mu_0) \right)=0</math> <br />
<br />
It is easy to see that, under QDA, the decision boundary <math>\,D(h^*)</math> has the form <math>\,ax^2+bx+c=0</math> and it is quadratic in <math>\,x</math>. This is where the word ''quadratic'' in quadratic discriminant analysis comes from.<br />
<br />
As is the case with LDA, QDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\log(\frac{|\Sigma_m|}{|\Sigma_n|})-\frac{1}{2}\left( x^\top(\Sigma_m^{-1}-\Sigma_n^{-1})x + \mu_m^\top\Sigma_m^{-1}\mu_m - \mu_n^\top\Sigma_n^{-1}\mu_n - 2x^\top(\Sigma_m^{-1}\mu_m-\Sigma_n^{-1}\mu_n) \right)=0</math>.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h^*(x) = \arg\max_{k} \delta_k(x)</math> <br />
where, <br />
* In the case of LDA, which assumes that a common covariance matrix is shared by all classes, <math> \,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is linear in <math>\,x</math>.<br />
<br />
* In the case of QDA, which assumes that each class has its own covariance matrix, <math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is quadratic in <math>\,x</math>.<br />
<br />
<br />
'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
[http://www.stat.cmu.edu/~larry/=stat707/notes10.pdf See Theorem 46.6 Page 133]<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the Maximum Likelihood estimates from the sample for <math>\,\pi,\mu_k,\Sigma_k</math> in place of their true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance, denoted <math>\Sigma</math>, is defined as the weighted average of the covariance for each class. <br />
<br />
In the case where we need a common covariance matrix, we get the estimate using the following equation:<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{n-k} </math><br />
<br />
Where: <math>\,n_r</math> is the number of data points in class r, <math>\,\Sigma_r</math> is the covariance of class r and <math>\,n</math> is the total number of data points,<br />
<math>\,k</math> is the number of classes.<br />
<br />
See the details about the [http://en.wikipedia.org/wiki/Estimation_of_covariance_matrices estimation of covarience matrices].<br />
<br />
===Computation For QDA And LDA===<br />
<br />
First, let us consider QDA, and examine each of the following two cases.<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
<math>\, \Sigma_k = I </math> for every class <math>\,k</math> implies that our data is spherical. This means that the data of each class <math>\,k</math> is distributed symmetrically around the center <math>\,\mu_k</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{-1}{2}log(|I|)</math>, is zero since <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximize <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = U_kS_kV_k^\top = U_kS_kU_k^\top </math> (In general when <math>\,X=U_kS_kV_k^\top</math>, <math>\,U_k</math> is the eigenvectors of <math>\,X_kX_k^T</math> and <math>\,V_k</math> is the eigenvectors of <math>\,X_k^\top X_k</math>. <br />
So if <math>\, X_k</math> is symmetric, we will have <math>\, U_k=V_k</math>. Here <math>\, \Sigma_k </math> is symmetric, because it is the covariance matrix of <math> X_k </math>) and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (U_kS_kU_k^\top)^{-1} = (U_k^\top)^{-1}S_k^{-1}U_k^{-1} = U_kS_k^{-1}U_k^\top </math> (since <math>\,U_k</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top U_kS_k^{-1}U_k^T(x-\mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-1}(U_k^\top x-U_k^\top \mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top x-U_k^\top\mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the squared Euclidean distance between <math> \, S_k^{-\frac{1}{2}}U_k^\top x </math> and <math>\, S_k^{-\frac{1}{2}}U_k^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>.<br />
<br />
A similar transformation of all the centers can be done from <math>\,\mu_k</math> to <math>\,\mu_k^*</math> where <math> \, \mu_k^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top \mu_k </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math> and <math>\,\mu_k^*</math>, treating them as in Case 1 above.<br />
<br />
{{Cleanup|date=October 18 2010|reason=The sentence above may cause some misleading. In general case, <math>\,\Sigma_k </math> may not be the same . So you can't treat them completely the same as in Case 1 above. You need to compute <math>\, log{|\Sigma_k |} </math> differently. Here is a detailed discussion below:}}<br />
{{Cleanup|date=October 18 2010|reason=The sentence above is right since by transforming<math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>, the new variable variance is <math>I</math>}}<br />
<br />
<br />
Note that when we have multiple classes, we also need to compute <math>\, log{|\Sigma_k|}</math> respectively. Then we compute <math> \,\delta_k </math> for QDA .<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, in another word, have same covariance <math>\,\Sigma_k</math>,else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
{{Cleanup|date=October 18 2010|reason=The statement above may not be true, because in assignment 1, we did do the QDA computation using this approach although the corresponding three covarience matrices are different, the reason why the answer is Yes is as below }}<br />
<br />
The answer is Yes. Consider that you have two classes with different shapes. Given a data point, justify which class this point belongs to. You just do the transformations corresponding to the 2 classes respectively, then you get <math>\,\delta_1 ,\delta_2 </math> ,then you determine which class the data point belongs to by comparing <math> \,\delta_1 </math> and <math> \,\delta_2 </math> .<br />
<br />
In summary, to apply QDA on a data set <math>\,X</math>, in the general case where <math>\, \Sigma_k \ne I </math> for each class <math>\,k</math>, one can proceed as follows:<br />
<br />
:: Step 1: For each class <math>\,k</math>, apply singular value decomposition on <math>\,X_k</math> to obtain <math>\,S_k</math> and <math>\,U_k</math>.<br />
<br />
:: Step 2: For each class <math>\,k</math>, transform each <math>\,x</math> belonging to that class to <math>\,x^* = S_k^{-\frac{1}{2}}U_k^\top x</math>, and transform its center <math>\,\mu_k</math> to <math>\,\mu_k^* = S_k^{-\frac{1}{2}}U_k^\top \mu_k</math>.<br />
<br />
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x^*</math> and the transformed center <math>\,\mu^*</math> of each class, and assign <math>\,x</math> to the class such that the squared Euclidean distance between <math>\,x^*</math> and <math>\,\mu^*</math> is the least over all of the classes.<br />
<br />
<br />
Now, let us consider LDA. <br />
Here, one can derive a classification scheme that is quite similar to that shown above. The main difference is the assumption of a common variance across the classes, so we perform the Singular Value Decomposition once, as opposed to k times.<br />
<br />
To apply LDA on a data set <math>\,X</math>, one can proceed as follows:<br />
<br />
:: Step 1: Apply singular value decomposition on <math>\,X</math> to obtain <math>\,S</math> and <math>\,U</math>.<br />
<br />
:: Step 2: For each <math>\,x \in X</math>, transform <math>\,x</math> to <math>\,x^* = S^{-\frac{1}{2}}U^\top x</math>, and transform each center <math>\,\mu</math> to <math>\,\mu^* = S^{-\frac{1}{2}}U^\top \mu</math>.<br />
<br />
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x^*</math> and the transformed center <math>\,\mu^*</math> of each class, and assign <math>\,x</math> to the class such that the squared Euclidean distance between <math>\,x^*</math> and <math>\,\mu^*</math> is the least over all of the classes.<br />
<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In actual data scenarios, it is generally true that QDA will provide a better classifier for the data then LDA because QDA does not assume that the covariance matrix for each class is identical, as LDA assumes. However, QDA still assumes that the class conditional distribution is Gaussian, which is not always the case in real-life scenarios. The link provided at the beginning of this paragraph describes a kernel-based QDA method which does not have the Gaussian distribution assumption.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA - September 28, 2010==<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension. Pay attention, We don't do QDA with LDA. If we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below applies LDA to the same data set and reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that the algorithm created to separate the data into the two classes.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the classes of the points 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x*y+%g*y^2', k, l(1), l(2), q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that QDA is only correct in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve produced by QDA that do not lie on the correct side of the line produced by LDA.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Principal Component Analysis - September 30, 2010==<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br />
<br /><br />
Principal component analysis (PCA) is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf]. Depending on where this methodology is applied, other common names of PCA include the [http://en.wikipedia.org/wiki/Karhunen%E2%80%93Lo%C3%A8ve_theorem Karhunen–Loève transform (KLT)] , the [http://en.wikipedia.org/wiki/Harold_Hotelling Hotelling transform], and the proper orthogonal decomposition (POD). PCA is the simplist [http://en.wikipedia.org/wiki/Eigenvector eigenvector]-based [http://en.wikipedia.org/wiki/Multivariate_analysis multivariate analysis]. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or '''principal components''') of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could. <br />
<br />
<br />
Furthermore, if one considers the lower dimensional representation produced by PCA as a least squares fit of our original data, then it can also be easily shown that this representation is the one that minimizes the reconstruction error of our data. It should be noted however, that one usually does not have control over which dimensions PCA deems to be the most informative for a given set of data, and thus one usually does not know which dimensions PCA selects to be the most informative dimensions in order to create the lower-dimensional representation. <br />
<br />
<br />
Suppose <math>\,X</math> is our data matrix containing <math>\,d</math>-dimensional data. The idea behind PCA is to apply [http://en.wikipedia.org/wiki/Singular_value_decomposition singular value decomposition] to <math>\,X</math> to replace the rows of <math>\,X</math> by a subset of it that captures as much of the [http://en.wikipedia.org/wiki/Variance variance] in <math>\,X</math> as possible. First, through the application of singular value decomposition to <math>\,X</math>, PCA obtains all of our data's directions of variation. These directions would also be ordered from left to right, with the leftmost directions capturing the most amount of variation in our data and the rightmost directions capturing the least amount. Then, PCA uses a subset of these directions to map our data from its original space to a lower-dimensional space. <br />
<br />
<br />
By applying singular value decomposition to <math>\,X</math>, <math>\,X</math> is decomposed as <math>\,X = U\Sigma V^T \,</math>. The <math>\,d</math> columns of <math>\,U</math> are the [http://en.wikipedia.org/wiki/Eigenvector eigenvectors] of <math>\,XX^T \,</math>.<br />
The <math>\,d</math> columns of <math>\,V</math> are the eigenvectors of <math>\,X^TX \,</math>. The <math>\,d</math> diagonal values of <math>\,\Sigma</math> are the square roots of the [http://en.wikipedia.org/wiki/Eigenvalue eigenvalues] of <math>\,XX^T \,</math> (also of <math>\,X^TX \,</math>), and they correspond to the columns of <math>\,U</math> (also of <math>\,V</math>). <br />
<br />
<br />
We are interested in <math>\,U</math>, whose <math>\,d</math> columns are the <math>\,d</math> directions of variation of our data. Ordered from left to right, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most informative direction of variation of our data. That is, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most effective column in terms of capturing the total variance exhibited by our data. A subset of the columns of <math>\,U</math> is used by PCA to reduce the dimensionality of <math>\,X</math> by projecting <math>\,X</math> onto the columns of this subset. In practice, when we apply PCA to <math>\,X</math> to reduce the dimensionality of <math>\,X</math> from <math>\,d</math> to <math>\,k</math>, where <math>k < d\,</math>, we would proceed as follows:<br />
<br />
:: Step 1: Center <math>\,X</math> so that it would have zero mean.<br />
<br />
:: Step 2: Apply singular value decomposition to <math>\,X</math> to obtain <math>\,U</math>.<br />
<br />
:: Step 3: Suppose we denote the resulting <math>\,k</math>-dimensional representation of <math>\,X</math> by <math>\,Y</math>. Then, <math>\,Y</math> is obtained as <math>\,Y = U_k^TX</math>. Here, <math>\,U_k</math> consists of the first (leftmost) <math>\,k</math> columns of <math>\,U</math> that correspond to the <math>\,k</math> largest diagonal elements of <math>\,\Sigma</math>.<br />
<br />
<br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 130 images (28 by 23 pixels) of handwritten threes. <br />
{{Cleanup|date=September 6 2010|reason=If anyone can tell me where I can find the 2-3 data set, I would create the new image. In the mean time, I found a non-copyrighted image of different looking 3s online, but as you can see, it is not as nice as one we could make.}}<br />
{{Cleanup|date=September 6 2010|reason=I think you can find it on your UW-ACE account for this course.}}<br />
<br />
[[File:Handwritten 3s.gif]]<br />
<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 130 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[Image:23plotPCA.jpg]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slant of each handwritten digit.<br />
<br />
===Derivation of the first Principle Component===<br />
{{Cleanup|date=October 2010|reason=I think English of this section must be improved}}<br />
We want to find the direction of maximum variation. Let <math>\begin{align}\textbf{w}\end{align}</math> be an arbitrary direction, <math>\begin{align}\textbf{x}\end{align}</math> a data point and <math>\begin{align}\displaystyle u\end{align}</math> the length of the projection of <math>\begin{align}\textbf{x}\end{align}</math> in direction <math>\begin{align}\textbf{w}\end{align}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\begin{align}\textbf{w}\end{align}</math> is the same as <math>\begin{align}c\textbf{w}\end{align}</math>, for any scalar <math>c</math>, so without loss of generality, we assume that: <br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}.<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}. <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} .</math><br />
<br /><br /><br />
The above is the variance of <math>\begin{align}\displaystyle u \end{align}</math> formed by the weight vector <math>\begin{align}\textbf{w} \end{align}</math>. The first principal component is the vector <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the function. Our goal is to find the weight <math>\begin{align}\textbf{w} \end{align}</math> that maximizes this variability, subject to a constraint. Since our function is convex, it has no maximum value. Therefore we need to add a constraint that restricts the length of <math>\begin{align}\textbf{w} \end{align}</math>. However, we are only interested in the direction of the variability, so the problem becomes<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
s.t. <math>\textbf{w}^T \textbf{w} = 1.</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|.<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
====Lagrange Multiplier====<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector then the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
{{Cleanup|date=October 2010|reason=It is good discussion, what will happen if we don't have distinct eigenvalues and eigenvectors? What does this situation mean? }}<br />
{{Cleanup|date=October 2010|reason=If the eigenvalues are not distinct, I suppose we could still take the leftmost eigenvector by default. Not sure if this is the correct approach, so can anyone please explain further? Thanks }}<br />
{{Cleanup|date=October 2010|reason= As U is the eigenvector of a symetric matrix, is it possible that we have 2 similar eigen vector? }}<br />
<br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = Var(\sum_{i = 1}^n x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
m_X=mean(X,2);<br />
mm=repmat(m_X,1,300);<br />
XX=X-mm;<br />
[u s v] = svd(XX);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
xHat=xHat+mm;<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. This is because almost all of the noise in the noisy image is captured by the principal components (directions of variation) that capture the least amount of variation in the image, and these principal components were discarded when we used the few principal components that capture most of the image's variation to generate the image's lower-dimensional representation. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Extraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data.<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized as follows (taken from the Lecture Slides).<br />
<br />
====Algorithm ====<br />
'''Recover basis:''' Calculate <math> XX^T =\Sigma_{i=1}^{n} x_i x_{i}^{T} </math> and let <math> U=</math> eigenvectors of <math> X X^T </math> corresponding to the top <math> d </math> eigenvalues.<br />
<br />
'''Encoding training data:''' Let <math>Y=U^TX </math> where <math>Y</math> is a <math>d \times n</math> matrix of encoding of the original data.<br />
<br />
'''Reconstructing training data:''' <math>\hat{X}= UY=UU^TX </math>.<br />
<br />
'''Encode set example:''' <math> y=U^T x </math> where <math> y </math> is a <math>d-</math>dimentional encoding of <math>x</math>.<br />
<br />
'''Reconstruct test example:''' <math>\hat{x}= Uy=UU^Tx </math>.<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{D \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, U_dY_{d \times n} </math>, where <math>\,U_d</math> contains the first (leftmost) <math>\,d</math> columns of <math>\,U</math>.<br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
== Fisher's (Linear) Discriminant Analysis (FDA) - Two Class Problem - October 5, 2010 ==<br />
<br />
===Sir Ronald A. Fisher===<br />
Fisher's Discriminant Analysis (FDA), also known as Fisher's Linear Discriminant Analysis (LDA) in some sources, is a classical feature extraction technique. It was originally described in 1936 by Sir [http://en.wikipedia.org/wiki/Ronald_A._Fisher Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here]. <br />
In this paper Fisher used for the first time the term DISCRIMINANT FUNCTION. The term DISCRIMINANT ANALYSIS was introduced later by Fisher himself in a subsequent paper which can be found [http://digital.library.adelaide.edu.au/coll/special//fisher/155.pdf here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
As in PCA, the goal of FDA is to project the data in a lower dimension. You might ask, why was FDA invented when PCA already existed? There is a simple explanation for this that can be found [http://www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf here]. PCA is an unsupervised method for classification, so it does not take into account the labels in the data. Suppose we have two clusters that have very different or even opposite labels from each other but are nevertheless positioned in a way such that they are very much parallel to each other and also very near to each other. In this case, most of the total variation of the data is in the direction of these two clusters. If we use PCA in cases like this, then both clusters would be projected onto the direction of greatest variation of the data to become sort of like a single cluster after projection. PCA would therefore mix up these two clusters that, in fact, have very different labels. What we need to do instead, in this cases like this, is to project the data onto a direction that is orthogonal to the direction of greatest variation of the data. This direction is in the least variation of the data. On the 1-dimensional space resulting from such a projection, we would then be able to effectively classify the data, because these two clusters would be perfectly or nearly perfectly separated from each other taking into account of their labels. This is exactly the idea behind FDA. <br />
<br />
The main difference between FDA and PCA is that, in FDA, in contrast to PCA, we are not interested in retaining as much of the variance of our original data as possible. Rather, in FDA, our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for a direction that is most representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
Suppose we have 2-dimensional data, then FDA would attempt to project the data of each class onto a point in such a way that the resulting two points would be as far apart from each other as possible. Intuitively, this basic idea behind FDA is the optimal way for separating each pair of classes along a certain direction. <br />
<br />
{{Cleanup|date=October 2010|reason= Just a thought: how relevant is "Dimensionality reduction techniques" to the concept of "subspace clustering"? As in subspace clustering, the goal is to find a set of features (relevant features, the concept is referred to as local feature relevance in the literature) in the high dimensional space, where potential subspaces accommodating different classes of data points can be defined. This means; the data points are dense when they are considered in a subset of dimensions (features).}}<br />
{{Cleanup|date=October 2010|reason=If I'm not mistaken, classification techniques like FDA use labeled training data whereas clustering techniques use unlabeled training data instead. Any other input regarding this would be much appreciated. Thanks}}<br />
{{Cleanup|date=October 2010|reason=An extension of clustering is subspace clustering in which different subspace are searched through to find the relavant and appropriate dimentions. High dimentional data sets are roughly equiedistant from each other, so feature selection methods are used to remove the irrelavant dimentions. These techniques do not keep the relative distance so PCA is not useful for these applications. It should be noted that subspace clustering localize their search unlike feature selection algorithms.for more information click here[http://portal.acm.org/citation.cfm?id=1007731]}}<br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes:<br />
<br><br />
For a 2-classes problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k-classes problem, we want to reduce the data to k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within-variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
{{Cleanup|date=October2010|reason=Anyone please add an example to make the comparison clearer}}<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. In the paper,"[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] " written by our instructor, they propose a closed-form solution to one algorithm that previously required expensive semidefinite optimization. They provide a new problem setup in which the algorithm performs better or as well as some standard methods, but without the computational complexity. Furthermore, they show a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). They also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
===FDA Goals===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
==== Example in R ====<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
<br />
FDA projects the data into lower dimensional space, where the distances between the projected means are maximum and the within-class variances are minimum. There are two categories of classification problems:<br />
<br />
1. Two-class problem<br />
<br />
2. Multi-class problem (addressed next lecture)<br />
<br />
=== Two-class problem ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within each class''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances (the summation of the two covariances is a valid covariance, satisfying the symmetry and positive semi-definite criteria). <br />
<br />
{{Cleanup|date=October 2010|reason=In 2. above, I wonder if the computation would be much more complex if we instead find a weighted sum of the covariances of the two classes where the weights are the sizes of the two classes?}}<br />
<br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> where <math>\ z_i </math> is a scalar<br />
<br />
====1. Minimizing within-class variance==== <br />
<math>\displaystyle \min_w (\underline{w}^T\sum_1\underline{w}) </math><br />
<br />
<math>\displaystyle \min_w (\underline{w}^T\sum_2\underline{w}) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math><br />
<br> (where <math>\,\sum_1</math> and <math>\,\sum_2 </math> are the covariance matrices of the 1st and 2nd classes of data, respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within-classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math>.<br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br /> <br />
<math>\displaystyle \max_w ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2, </math><br />
<br /><br /><br />
<math>\begin{align} ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2 &= (\underline{w}^T \mu_1 - \underline{w}^T \mu_2)^T(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\<br />
&= (\mu_1^T\underline{w} - \mu_2^T\underline{w})(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\<br />
&= (\mu_1 - \mu_2)^T \underline{w} \underline{w}^T (\mu_1 - \mu_2) \\<br />
<br />
&= ((\mu_1 - \mu_2)^T \underline{w})^{T} (\underline{w}^T (\mu_1 - \mu_2))^{T} \\<br />
&= \underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T \underline{w} \end{align}</math><br /><br />
<br />
Note that in the last line above the order is rearranged clockwise because the answer is a scalar.<br />
<br />
Let <math>\displaystyle s_B = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math>, the between-class covariance, then the goal is to <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math>.<br />
<br />
===The Objective Function for FDA===<br />
We want an objective function which satisfies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math> or <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math><br />
# <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math> <br />
<br /><br /><br />
So, we construct our objective function as maximizing the ratio of the two goals brought above:<br /><br />
<br /><br />
<math>\displaystyle \max_w \frac{(\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w})} {(\underline{w}^T(\sum_1 + \sum_2)\underline{w})} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max_w \frac{(\underline{w}^Ts_B\underline{w})}{(\underline{w}^Ts_w\underline{w})}</math> <br /><br />
One may argue that we can use subtraction for this purpose, while this approach is true but it can be shown it will need another scaling factor. Thus using this ratio is more efficient.<br />
<br />
As the objective function is convex, and so it does not have a maximum. To get around this problem, we have to add the constraint that w must have unit length, and then solvethis optimization problem we form the lagrangian:<br />
<br />
<br /><br /><br />
<math>\displaystyle L(\underline{w},\lambda) = \underline{w}^Ts_B\underline{w} - \lambda (\underline{w}^Ts_w\underline{w} -1)</math><br /><br /><br />
<br />
<br /><br />
Then, we equate the partial derivative of L with respect to <math>\underline{w}</math>:<br />
<math>\displaystyle \frac{\partial L}{\partial \underline{w}}=2s_B \underline{w} - 2\lambda s_w \underline{w} = 0 </math> <br /><br />
<br />
<math>s_B \underline{w} = \lambda s_w \underline{w}</math><br /><br />
<math>s_w^{-1}s_B \underline{w}= \lambda\underline{w}</math><br /><br /><br />
This is in the form of generalized eigenvalue problem. Therefore, <math> \underline{w}</math> is the largest eigenvector of <math>s_w^{-1}s_B </math><br /><br />
<br />
This solution can be further simplified as follow:<br /><br />
<br />
<math>s_w^{-1}(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w} = \lambda\underline{w} </math><br /><br />
<br />
Since <math>(\mu_1 - \mu_2)^T\underline{w}</math> is a scalar then <math>s_w^{-1}(\mu_1 - \mu_2)</math>∝<math>\underline{w}</math> <br /><br /><br />
This gives the direction of <math>\underline{w}</math> without doing eigenvalue decomposition in the case of 2-class problem.<br />
<br />
Note: In order for <math>{s_w}</math> to have an inverse, it must have full rank. This can be achieved by ensuring that the number of data points <math>\,\ge</math> the dimensionality of <math>\underline{x_{i}}</math>.<br />
<br />
===FDA Using Matlab===<br />
Note: ''The following example was not actually mentioned in this lecture''<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods. <br />
%First of all, we generate the two data set:<br />
% First data set X1<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300);<br />
%In this case: <br />
mu_1=[1;1]; <br />
Sigma_1=[1 1.5; 1.5 3]; <br />
%where mu and sigma are the mean and covariance matrix.<br />
% Second data set X2<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300); <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
plot(X1(:,1),X1(:,2),'.b'); hold on;<br />
plot(X2(:,1),X2(:,2),'ob')<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
% Combine data sets to map both into the same subspace<br />
X=[X1;X2];<br />
X=X';<br />
% We used built-in PCA function in Matlab<br />
[coefs, scores]=princomp(X);<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=sw\[4; 2] % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is very little overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
===Some of FDA applications===<br />
There are many applications for FDA in many domains some of them are stated below:<br />
<br />
* SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS<br />
FDA can be used to enhance listening comprehension when the user goes from a sound<br />
environment to another different one. For more information review this paper by Alexandre et al.[http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/papers/1569101740.pdf here]<br />
<br />
* Application to Face Recognition<br />
FDA can be used in face recognition at different situation. Using FDA Kong et al. proposes an Application to Face<br />
Recognition with Small Number of Training Samples [http://person.hst.aau.dk/pimuller/2D_FDA_Face_CVPR05fish.pdf here].<br />
<br />
* Palmprint Recognition<br />
FDA is used in biometrics, to implement an automated palmprint recognition system. See An Automated Palmprint Recognition System by Tee et al. [http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6V09-4FJ5XPN-1-1&_cdi=5641&_user=1067412&_pii=S0262885605000089&_origin=search&_coverDate=05%2F01%2F2005&_sk=999769994&view=c&wchp=dGLbVzz-zSkWb&md5=a064b67c9bdaaba7e06d800b6c9b209b&ie=/sdarticle.pdf here].<br />
<br />
{{Cleanup|date=October 2010|reason=I think briefing about the other applications would be easier than browsing through all of these applications}}<br />
<br />
{{Cleanup|date=October 2010|reason= This link is no longer valid.}}<br />
<br />
other applications could found in references 4,5,6,7,8 and more in [http://www.sciencedirect.com/science?_ob=ArticleListURL&_method=list&_ArticleListID=1489148820&_sort=r&_st=13&view=c&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=f210273546a659c90ae0962fce7b8b4e&searchtype=a here]<br />
<br />
=== '''References'''===<br />
1. Kong, H.; Wang, L.; Teoh, E.K.; Wang, J.-G.; Venkateswarlu, R.; , "A framework of 2D Fisher discriminant analysis: application to face recognition with small number of training samples," Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , vol.2, no., pp. 1083- 1088 vol. 2, 20-25 June 2005<br />
doi: 10.1109/CVPR.2005.30<br />
[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1467563&isnumber=31473 1]<br />
<br />
2. Enrique Alexandre, Roberto Gil-Pita, Lucas Cuadra, Lorena A´lvarez, Manuel Rosa-Zurera, "SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS USING A TWO-LAYER CLASSIFICATION SYSTEM WITH MSE LINEAR DISCRIMINANTS", 16th European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland, August 25-29, 2008, copyright by EURASIP, [http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/welcome.html 2]<br />
<br />
3. Connie, Tee; Jin, Andrew Teoh Beng; Ong, Michael Goh Kah; Ling, David Ngo Chek; "An automated palmprint recognition system", Journal of Image and Vision Computing, 2005. [http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V09-4FJ5XPN-1&_user=1067412&_coverDate=05/01/2005&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_searchStrId=1489147048&_rerunOrigin=google&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=a781a68c29fbf127473ae9baa5885fe7&searchtype=a 3]<br />
<br />
4. met, Francesca; Boqué, Ricard; Ferré, Joan; "Application of non-negative matrix factorization combined with Fisher's linear discriminant analysis for classification of olive oil excitation-emission fluorescence spectra", Journal of Chemometrics and Intelligent Laboratory Systems, 2006.<br />
[http://www.sciencedirect.com/science/article/B6TFP-4HR769Y-1/2/b5244d459265abb3a1bf5238132c737e 4]<br />
<br />
5. Chiang, Leo H.;Kotanchek, Mark E.;Kordon, Arthur K.; "Fault diagnosis based on Fisher discriminant analysis and support vector machines"<br />
Journal of Computers & Chemical Engineering, 2004<br />
[http://www.sciencedirect.com/science/article/B6TFT-4B4XPRS-1/2/bca7462236924d29ea23ec633a6eb236 5]<br />
<br />
6. Yang, Jian ;Frangi, Alejandro F.; Yang, Jing-yu; "A new kernel Fisher discriminant algorithm with application to face recognition", 2004<br />
[http://www.sciencedirect.com/science/article/B6V10-4997WS1-1/2/78f2d27c7d531a3f5faba2f6f4d12b45 6]<br />
<br />
7. Cawley, Gavin C.; Talbot, Nicola L. C.; "Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers", Journal of Pattern Recognition , 2003 [http://www.sciencedirect.com/science/article/B6V14-492718R-1/2/bd6e5d0495023a1db92ab7169cc96dde 7]<br />
<br />
8. Kodipaka, S.; Vemuri, B.C.; Rangarajan, A.; Leonard, C.M.; Schmallfuss, I.; Eisenschenk, S.; "Kernel Fisher discriminant for shape-based classification in epilepsy" Hournal Medical Image Analysis, 2007. [http://www.sciencedirect.com/science/article/B6W6Y-4MH8BS0-1/2/055fb314828d785a5c3ca3a6bf3c24e9 8]<br />
<br />
==Fisher's (Linear) Discriminant Analysis (FDA) - Multi-Class Problem - October 7, 2010==<br />
<br />
====Obtaining Covariance Matrices====<br />
<br />
<br />
The within-class covariance matrix <math>\mathbf{S}_{W}</math> is easy to obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between-class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to compute directly. To bypass this problem we use the following method. We know that the total covariance <math>\,\mathbf{S}_{T}</math> of a given set of data is constant and known, and we can also decompose this variance into two parts: the within-class variance <math>\mathbf{S}_{W}</math> and the between-class variance <math>\mathbf{S}_{B}</math> in a way that is similar to [http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA]. We thus have:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
where the total variance is given by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = <br />
\frac{1}{n}<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
We can now get <math>\mathbf{S}_{B}</math> from the relationship: <br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
<br />
Actually, there is another way to obtain <math>\mathbf{S}_{B}</math>. Suppose the data contains <math>\, k </math> classes, and each class <math>\, j </math> contains <math>\, n_{j} </math> data points. We denote the overall mean vector by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\frac{1}{n} \sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within-class covariance <math>\mathbf{S}_{W}</math><br />
and the between-class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term in the final line of the derivation above as the between-class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^* =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})^{T})<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))((\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W})<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the following as our measure:<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to the largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
{{Cleanup|date=What if we encounter complex eigenvalues? Then concept of being large does not dense. What is the solution in that case? }}<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
<br />
Recall that the Frobenius norm of <math>X</math> is <br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2_{2} = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have the following classic criterion function that Fisher used<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to the largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
{{Cleanup|date=October 2010|reason=Adding more general comments about the advantages and flaws of FDA would be effective here.}}<br />
<br />
{{Cleanup|date=October 2010|reason=Would you please show how could we reconstruct our original data from the data that its dimentionality is reduced by FDA.}}<br />
{{Cleanup|date=October 2010|reason= When you reduce the dimensionality of data in most general form you lose some features of the data and you cannot reconstruct the data from redacted space unless the data have special features that help you in reconstruction like sparsity. In FDA it seems that we cannot reconstruct data in general form using reducted version of data }}<br />
<br />
===Generalization of Fisher's Linear Discriminant Analysis ===<br />
<br />
Fisher's Linear Discriminant Analysis (Fisher, 1936) is very popular among users of discriminant analysis. Some of the reasons for this are its simplicity<br />
and lack of necessity for strict assumptions. However, it has optimality properties only if the underlying distributions of the groups are multivariate normal. It is also easy to verify that the discriminant rule obtained can be very harmed by only a small number of outlying observations. Outliers are very hard to detect in multivariate data sets and even when they are detected simply discarding them is not the most efficient way of handling the situation. Therefore, there is a need for robust procedures that can accommodate the outliers and are not strongly affected by them. Then, a generalization of Fisher's linear discriminant algorithm [[http://www.math.ist.utl.pt/~apires/PDFs/APJB_RP96.pdf]] is developed to lead easily to a very robust procedure.<br />
<br />
Also notice that LDA can be seen as a dimensionality reduction technique. In general k-class problems, we have k means which lie on a linear subspace with dimension k-1. Given a data point, we are looking for the closest class mean to this point. In LDA, we project the data point to the linear subspace and calculate distances within that subspace. If the dimensionality of the data, d, is much larger than the number of classes, k, then we have a considerable drop in dimensionality from d dimensions to k - 1 dimensions.<br />
<br />
==Linear and Logistic Regression - October 12, 2010==<br />
<br />
===Linear Regression===<br />
Linear regression is an approach for modeling the scalar value <math>\, y</math> from a set of dependent variables <math>\,X</math>. In linear regression the goal is to find an appropriate set of dependent variables to <math>\, y</math> and try to estimate its value from the related set. While in classification the goal is to classify data to different groups in which the inner similarity among the group members are more than variables which belong to different groups.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
According to Bayes Classification we estimate the posterior as,<br/><br />
<math>P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{l}f_{l}(x)\pi_{l}}</math><br/><br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The simple linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
y_i = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
and we can denote it as<br />
:<math><br />
\begin{align}<br />
\mathbf{y} = \beta^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
where <math>\,\beta^{T} = (<br />
\beta_1,..., \beta_{d},\beta_0)</math> is a <math>1 \times (d+1)</math> vector and <math>\mathbf{X}=<br />
\begin{pmatrix}<br />
\mathbf{x}_{1}, \dots,\mathbf{x}_{n}\\<br />
1, \dots, 1<br />
\end{pmatrix}<br />
</math> is a <math>(d+1) \times n</math> matrix, here <math>\mathbf{x}_{i} </math> is a <math>d \times 1</math> vector<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{n}</math> and <math>\,y_{1}, ..., y_{n}</math> our goal is to find <math>\,\beta^{T}</math> such that the linear model fits the data while minimizing sum of squared errors using the [http://en.wikipedia.org/wiki/Least_squares Least Squares method].<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
We then try to minimize the residual sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\beta^{T}\mathbf{X})^{T}(\mathbf{y}-\beta^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}(\mathbf{y}-\beta^{T}\mathbf{X})^{T}<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}(\mathbf{y}^{T}-\mathbf{X}^{T}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}\mathbf{y}^{T}<br />
\end{align}<br />
</math><br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \hat\beta^{T}\mathbf{X} = <br />
\mathbf{y}\mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X} </math> is called the [http://en.wikipedia.org/wiki/Hat_matrix hat matrix].<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{l}f_{l}(x)\pi_{l}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1. This is more direct approach to classification since it do not need to estimate <math>\ f_k(x) </math> and <math>\ \pi_k </math>.<br />
<math>\ 1 \times P(Y=1|X=x)+0 \times P(Y=0|X=x)=E(Y|X) </math>. <br />
This model does not classify Y between 0 and 1, so it is not good but at times it can lead to a decent classifier. <math>\ y_i=\frac{1}{n_1} </math> <math>\ \frac{-1}{n_2} </math><br />
<br />
===Logistic Regression===<br />
The [http://en.wikipedia.org/wiki/Logistic_regression logistic regression] model arises from the desire to model the posterior probabilities of the <math>\displaystyle K</math> classes via linear functions in <math>\displaystyle x</math>, while at the same time ensuring that they sum to one and remain in [0,1].Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood ,using <math>\displaystyle Pr(Y|X)</math>. Since <math>\displaystyle Pr(Y|X)</math> completely specifies the conditional distribution, the multinomial distribution is appropriate. This model is widely used in biostatistical applications for two classes. For instance: people survive or die, have a disease or not, have a risk factor or not.<br />
<br />
==== logistic function ====<br />
[[File:200px-Logistic-curve.svg.png | Logistic Sigmoid Function]]<br />
<br />
<br />
<br />
A [http://en.wikipedia.org/wiki/Logistic_function logistic function] or logistic curve is the most common sigmoid curve. <br />
<br />
1. <math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
2. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
3. <math>y(0) = \frac{1}{2}</math><br />
<br />
4. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
5. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
====Intuition behind Logistic Regression====<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
====The Logistic Regression Model====<br />
<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
{{Cleanup|date=October 13 2010|reason=It could be useful to have sources for these graphs. We don't know if they are copyrighted}}<br />
{{Cleanup|date=October 18 2010|reason=I Could not find any source for these graphs. However, they following the definition of the defined probability. I don't think the generated graph as it is here is copyrighted, but if you worried you can draw this figure by applying the function and post the result.}}<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
{{Cleanup|date=October 13 2010|reason=It could be useful to have sources for these graphs. We don't know if they are copyrighted}}<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
====Fitting a Logistic Regression====<br />
Logistic regression tries to fit a distribution. The common practice in statistics is to fit density function, posterior density of each class(Pr(Y|X), to data using [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence and identical distribution)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
{{Cleanup|date=October 13 2010|reason=I think, in the following, y_i * x_i and the single x_i on the right side should both be transposed by matrix calculus?}}<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math> <br />
<br />
There are n+1 nonlinear equations in <math> \beta </math>. The first column is vector 1, then <math>\ \sum_{i=1}^n {y_i} =\sum_{i=1}^n p(\underline{x}_i;\underline{\beta}) </math> i.e. the expected number of class ones matches the observed number.<br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
====Extension====<br />
<br />
* When we are dealing with a problem with more than two classes, we need to generalize our logistic regression to a [http://en.wikipedia.org/wiki/Multinomial_logit Multinomial Logit model].<br />
<br />
* Limitations of Logistic Regression:<br />
:1. We know that there is no assumptions made about the distributions of the features of the data (i.e. the explanatory variables). However, the features should not be highly correlated with one another because this could cause problems with estimation.<br />
:2. Large number of data points (i.e.the sample sizes) are required for logistic regression to provide sufficient numbers in both classes. The more number of features/dimensions of the data, the larger the sample size required.<br />
<br />
==Lecture summary==<br />
{{Cleanup|date=October 18 2010|reason=Can anybody provide a better lecture summary? The one below is to just get it started}}<br />
In this lecture an introduction of the linear regression was presented as well as defining the density function for two-class problem. Maximum likelihood was used to define the distribution parameters (i.e. fitting density function to the logistic class.<br />
<br />
== Logistic Regression Cont. - October 14, 2010 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Estimating Parameters <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
'''Newton-Raphson Algorithm:'''<br /><br />
<br />
If we want to find <math>\ x^* </math> such that <math>\ f(x^*)=0</math><br />
<br />
We first pick a starting point <math>x^* = x^{old}</math> and and we solve:<br />
<br \><br />
<br />
<math>\ x^{*} \leftarrow x^{old}-\frac {f(x^{old})}{\partial f(x^{old})} </math> <br /><br />
<math> \ x^{old} \leftarrow x^{*}</math> <br />
<br /><br />
This is repeated till convergence <br />
<br />
If we want to maximize or minimize <math>\ f(x) </math>, then solve for <math>\ \partial f(x)=0 </math><br />
<br />
<math>\ X^{new} \leftarrow x^{old}-\frac {\partial f(x^{old})}{\partial^2 f(x^{old})} </math><br />
<br />
<br /><br />
<br />
In vector notation the above can be written as <br /><br />
<br />
<math><br />
X^{new} \leftarrow X^{old} - H^{-1}\Delta<br />
</math><br />
<br /><br />
H is the [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix] or second derivative matrix and <math>\Delta</math> is the gradient both evaluated at <math>X^{old}</math> <br />
<br /><br />
<br />
'''note:''' If the Hessian is not invertible the [http://en.wikipedia.org/wiki/Generalized_inverse generalized inverse] or pseudo inverse can be used<br />
<br /><br />
<br /><br />
<br />
<br />
As shown above ,the [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or Hessian.<br />
<br />
<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}_i))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{(d+1)}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i,i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^{T}\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
This is an adjusted response and it is solved repeatedly when <math>\ p </math>, <math>\ W </math>, and <math>\ z </math> changes. This algorithm is called [http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares iteratively reweighted least squares] because it solves the weighted least squares problem repeatedly.<br />
<br />
Recall that linear regression by least squares finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}^{T}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow arg \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta}^T)</math><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_{i,i}</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They both attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They both have linear boundaries.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1. There exists a closed form solution for least squares.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1. No closed form solution exists<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quadratically w.r.t dimension.<br />
#LDA estimate parameters more efficiently by using more information about data and samples without class labels can be also used in LDA. <br />
<br />
{{Cleanup|date=October 2010|reason= Could somebody please validate the following points}} <br />
{{Cleanup|date=October 2010|reason= I'm not too sure about the first point either, but it seems reasonable to me. Would be great if someone can confirm this point. Thanks}} <br />
<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust. (For high dimensionality logistic regression is more accommodating)<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
#Logistic regression is more robust, because it does not assume normal distribution regarding each independent variable.<br />
<br />
Many other advantages of logistic regression are explained [http://www.statgun.com/tutorials/logistic-regression.html here].<br />
<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>\,(d+1)</math><math>\,\times</math><math>\,(k-1)</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
===Lecture Summary===<br />
<br />
Traditionally logistic regression parameters are estimated using maximum likelihood. However , other optimization techniques may be used as well.<br />
<br /><br />
Since there is no closed form solution for finding the zero of the first derivative of the log likelihood the Newton Raphson algorithm is used. Since the problem is convex Newtons is guaranteed to converge to a global optimum.<br />
<br /><br />
Logistic regression requires less parameters than LDA or QDA and is therefore more favorable for high dimensional data.<br />
<br />
===Supplements===<br />
<br />
A detailed proof that logistic regression is convex is available [http://people.csail.mit.edu/jrennie/writing/convexLR.pdf here]. See '1 Binary LR' for the case we discussed in lecture.<br />
<br />
== '''Multi-Class Logistic Regression & Perceptron - October 19, 2010''' ==<br />
<br />
=== Lecture Summary ===<br />
<br />
In this lecture, the topic of logistic regression was finalized by covering the multi-class logistic regression and a new topic on perceptron was introduced. Perceptron is a linear classifier for two-class problems. The main goal of perceptron is classify data in 2 classes by minimizing the distances between the misclassified points and the decision boundary. This will be continued in the following lectures.<br />
<br />
=== Multi-Class Logistic Regression ===<br />
Recall that in two-class logistic regression, the posterior probability of one of the classes (say class 0) is modeled by a function in the form shown in figure 1. <br />
<br />
The posterior probability of the second class (say class 1) is the complement of the first class (class 0). <br /><br /><br />
<math>\displaystyle P(Y=0 | X=x) = 1 - P(Y=1 | X=x)</math><br /><br />
<br />
This function is called sigmoid logistic function, which is the reason why this algorithm is called "logistic regression".<br />
[[File:Picture1.png|150px|thumb|right|<math>Fig.1: P(Y=1 | X=x)</math>]]<br />
<br />
<math>\displaystyle \sigma\,\!(a) = \frac {e^a}{1+e^a} = \frac {1}{1+e^{-a}}</math><br /><br /><br />
<br />
In two-class logistic regression, we compare the posterior of one class to the other one using this ratio:<br /><br />
<br />
:<math> \frac{P(Y=1|X=x)}{P(Y=0|X=x)}</math><br /><br />
<br />
If we look at the natural logarithm of this ratio, we find that it is always a linear function in <math>x</math>:<br /><br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\underline{\beta}^T\underline{x} \quad \rightarrow (*)</math> <br /><br /><br />
<br />
What if we have more than two classes?<br /><br />
<br />
Using (*), we can extend the notion of logistic regression for the cases where we have more than two classes.<br /><br />
<br />
Assume we have <math>k</math> classes. Looking at the logarithm of the ratio of posteriors of each class and the k<sup>th</sup> class, we have: <br /><br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_1}^T\underline{x} </math> <br /><br />
:<math>\log\left(\frac{P(Y=2|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_2}^T\underline{x} </math> <br /><br />
<math> \vdots</math><br />
:<math>\log\left(\frac{P(Y=k-1|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_{k-1}}^T\underline{x} </math> <br /><br />
<br />
Although in the above posterior ratios, the denominator is chosen to be the posterior of the last class (class k), the choice of denominator is arbitrary in that the posterior estimates are equivariant under this choice - [http://www.springerlink.com/content/t45k620382733r71/ Linear Methods for Classification].<br /><br /><br />
<br />
Each of these functions is linear in <math>x</math>, however, we have different <math>\beta</math>s. We have to make sure that, the densities assigned to different classes sum to one.<br /><br /><br />
<br />
In general, we can write:<br />
<br /><math>P(Y=c | X=x) = \frac{e^{\underline{\beta_c}^T \underline{x}}}{1+\sum_{l=1}^{k-1}e^{\underline{\beta_l}^T \underline{x}}},\quad c \in \{1,\dots,k-1\} </math><br /><br />
<br /><math>P(Y=k | X=x) = \frac{1}{1+\sum_{l=1}^{k-1}e^{\underline{\beta_l}^T \underline{x}}}</math><br /><br />
These posteriors clearly sum to one. <br /><br /><br />
<br />
In the case of two-class problem, it is pretty simple to find <math>\beta</math> parameter (the <math>\beta</math> in two-class linear regression problems has <math>(d+1)\times1</math> dimension), as mentioned in previous lectures. In the multi-class case the iterative Newton method can be used, but here <math>\beta</math> is of size <math>(d+1)\times(k-1)</math> and the weight matrix W is a dense and non-diagonal matrix. This results in computationally inefficient, however feasible-to-be-solved algorithm. A trick would be to re-parametrize the logistic regression problem by expanding the input vector <math>x</math> (Question.4 in assignment no.2).<br />
<br /><br /><br />
<br />
===Nueral Network Concept===<br />
The concept of constructing an artificial neural network comes from scientists who like to simulate human neural network in their computers. They were trying to create computer programs that can learn like people. Neural network is a method in artificial intelligence which is a simplified models of neural processing in the brain, even though the relation between this model and brain biological architecture is not cleared yet.<br />
<br />
=== Perceptron ===<br />
<br />
Perceptron is a building block of Neural Networks. [http://en.wikipedia.org/wiki/Perceptron Perceptron] was invented in 1957 by [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt]. It is the basic building block of feedforward neural networks<br /><br /><br />
<br />
We know that least square obtained by regression of -1/1 response variable <math>\displaystyle Y</math> on observation <math>\displaystyle x</math>, lead to same coefficients as LDA. Recall that LDA minimizes the distance between discriminant function (decision boundary) and the data points. Least Square returns the sign of the linear combination of features as the class labels (figure 2). This was called perceptron in Engineering literature during 1950's. <br /><br /><br />
<br />
[[File:Perceptron.jpg|371px|thumb|right| Fig.2 Diagram of a linear perceptron ]]<br />
<br />
There is a cost function <math>\displaystyle D</math> that perceptron tries to minimize:<br /><br />
<br />
<math>D(\underline{\beta},\beta_0)=-\sum_{i \in M}y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math><br /><br />
<br />
where <math>\displaystyle M</math> is a set of misclassified points. <br /><br />
<br />
This is basically minimizing the sum of distances between the misclassified points and the decision boundary.<br /><br /><br />
<br />
'''Derivation''':'' The distances between the misclassified points and the decision boundary''.<br /><br /><br />
<br />
Consider points <math>\underline{x_1}</math>, <math>\underline{x_2}</math> and a decision boundary defined as <math>\underline{\beta}^T\underline{x}+\beta_0</math> as shown in figure 3.<br /><br />
<br />
[[File:DB.jpg|248px|thumb|right| Fig.3 Distance from the decision boundary ]]<br />
<br />
Both <math>\underline{x_1}</math> and <math>\underline{x_2}</math> lie on the decision boundary, then we have:<br /><br />
<math>\underline{\beta}^T\underline{x_1}+\beta_0=0 \rightarrow (1)</math><br /><br />
<math>\underline{\beta}^T\underline{x_2}+\beta_0=0 \rightarrow (2)</math><br /><br />
<br />
From (1) and (2):<br /><br />
<math>\underline{\beta}^T(\underline{x_2}-\underline{x_1})=0</math><br /><br />
<br />
Therefore, <math>\displaystyle \underline{\beta}</math> is orthogonal to <math>\underline{x_2}-\underline{x_1}</math> which is in the same direction with the decision boundary, which means that <math>\displaystyle \underline{\beta}</math> is orthogonal to the decision boundary. <br /><br />
<br />
Then the distance of a point <math>\underline{x_0}</math> from the decision boundary is: <br /><br />
<br />
<math>\underline{\beta}^T(\underline{x_0}-\underline{x_2})</math><br /><br />
<br />
From (2): <br /><br />
<br />
<math>\underline{\beta}^T\underline{x_2}= -\beta_0</math>. <br /><br />
<math>\underline{\beta}^T(\underline{x_0}-\underline{x_2})=\underline{\beta}^T\underline{x_0}-\underline{\beta}^T\underline{x_2}=\underline{\beta}^T\underline{x_0}+\beta_0</math><br /><br />
<br />
Therefore, distance between any point <math>\underline{x_{i}}</math> to the discriminant hyperplane is defined by <math>\underline{\beta}^T\underline{x_{i}}+\beta_0</math>.<br /><br /><br />
<br />
However, this quantity is not always positive. Considering <math>y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math>, if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive, since both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> are positive or both are negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'', then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> is positive and the other one is negative; hence, the product <math>y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math> will be negative for a misclassified point. The "-" sign in <math>D(\underline{\beta},\beta_0)</math> makes this cost function always positive. <br /><br /><br />
<br />
==Perceptron Learning Algorithm and Feed Forward Neural Networks - October 21, 2010 ==<br />
===Lecture Summary===<br />
In this lecture, we finalize our discussion of the Perceptron by reviewing its learning algorithm, which is based on gradient descent. We then begin the next topic, Neural Networks (NN), and we focus on a NN that is useful for classification: the Feed Forward Neural Network (FFNN). The mathematical model for the FFNN is shown, and we review one of its most popular learning algorithms: Back-Propagation. <br />
<br />
To open the Neural Network discussion, we present a formulation of the universal function approximator. The mathematical model for Neural Networks is then built upon this formulation. We also discuss the trade-off between training error and testing error -- known as the generalization problem -- under the universal function approximator section.<br />
<br />
===Perceptron===<br />
The last lecture introduced the Perceptron and showed how it can suggest a solution for the 2-class classification problem. We saw that the solution requires minimization of a cost function, which is basically a summation of the distances of the misclassified data points to the separating hyperplane. This cost function is<br />
<br />
<math>D(\underline{\beta},\beta_0)=-\sum_{i \in M}y_{i}(\underline{\beta}^T \underline{x}_i+\beta_0),</math><br />
<br />
in which, <math>\,M</math> is the set of misclassified points. Thus, the objective is to find <math>\arg\min_{\underline{\beta},\beta_0} D(\underline{\beta},\beta_0)</math>.<br />
<br />
====Perceptron Learning Algorithm====<br />
To minimize <math>D(\underline{\beta},\beta_0)</math>, an algorithm that uses gradient-descent has been suggested. Gradient descent, also known as steepest descent, is a numerical optimization technique that starts from an initial value for <math>(\underline{\beta},\beta_0)</math> and recursively approaches an optimal solution. Each step of recursion updates <math>(\underline{\beta},\beta_0)</math> by subtracting from it a factor of the gradient of <math>D(\underline{\beta},\beta_0)</math>. Mathematically, this gradient is<br />
<br />
<math>\nabla D(\underline{\beta},\beta_0)<br />
= \left( \begin{array}{c}\cfrac{\partial D}{\partial \underline{\beta}} \\ \\ <br />
\cfrac{\partial D}{\partial \beta_0} \end{array} \right)<br />
= \left( \begin{array}{c} -\displaystyle\sum_{i \in M}y_{i}\underline{x}_i^T \\ <br />
-\displaystyle\sum_{i \in M}y_{i} \end{array} \right)</math><br />
<br />
However, the perceptron learning algorithm does not use the sum of the contributions from each observation to calculate the gradient for each step. Instead, each step uses the gradient contribution from only a single observation, and each successive step uses a different observation. This slight modification is called stochastic gradient descent. That is, instead of subtracting some factor of <math>\nabla D(\underline{\beta},\beta_0)</math> at each step, we subtract a factor of<br />
<br />
<math>\left( \begin{array}{c} y_{i}\underline{x}_i \\ <br />
y_{i} \end{array} \right)</math><br />
<br />
As a result, the pseudo code for the Perceptron Learning Algorithm is as follows:<br />
<br />
:1) Choose a random initial value for <math>(\underline{\beta},\beta_0)</math>.<br />
<br />
:2) <math>\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{old}}\\<br />
\beta_0^{\mathrm{old}}<br />
\end{pmatrix}<br />
\leftarrow<br />
\begin{pmatrix}<br />
\underline{\beta}^0\\<br />
\beta_0^0<br />
\end{pmatrix}</math><br />
<br />
:3) <math>\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{new}}\\<br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
\leftarrow<br />
\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{old}}\\<br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix}<br />
y_i \underline{x_i}\\<br />
y_i<br />
\end{pmatrix}</math> for some <math>\,i \in M</math>.<br />
<br />
:4) If the termination criterion has not been met, go back to step 3 and use a different observation datapoint (i.e. a different <math>\,i</math>).<br />
<br />
The learning rate <math>\,\rho</math> controls the step size of convergence toward <math>\min_{\underline{\beta},\beta_0} D(\underline{\beta},\beta_0)</math>. A larger value for <math>\,\rho</math> causes the steps to be larger. If <math>\,\rho</math> is set to be too large, however, then the minimum could be missed (over-stepped).<br />
In practice, <math>\rho</math> can be adaptive and not fixed, it means that, in the first steps <math>\rho</math> could be larger than the last steps. At the beginning, larger <math>\rho</math> helps to find the approximate answer sooner. And smaller <math>\rho</math> in last steps help to tune the final answer more accurately. <br />
<br />
<br />
As mentioned earlier, the learning algorithm uses just one of the data points at each iteration; this is the common practice when dealing with online applications. In an online application, datapoints are accessed one-at-a-time because training data is not available in batch form. The learning algorithm does not require the derivative of the cost function with respect to the previously seen points; instead, we just have to take into consideration the effect of each new point.<br />
<br />
One way that the algorithm could terminate is if there are no more mis-classified points (i.e. if set <math>\,M</math> is empty. As long as there are points in <math>\,M</math>, the algorithm continues until some other termination criterion is reached. Termination criterion for an optimization algorithm is usually convergence, but for numerical methods this is not well-defined. In theory, convergence is realized when the gradient of the cost function is zero; in numerical methods an answer close to zero within some margin of error is taken instead.<br />
<br />
Since the data is linearly-separable, the solution is theoretically guaranteed to converge in a finite number of iterations. This number of iterations depends on the <br />
<br />
* learning rate <math>\,\rho</math><br />
<br />
* initial value <math>(\underline{\beta},\beta_0)</math><br />
<br />
* difficulty of the problem. The problem is more difficult if the gap between the classes of data is very small.<br />
<br />
Note that we consider the offset term <math>\beta_0</math> separately from the <math>\underline{\beta}</math> to distinguish this formulation from those in which the direction of the hyperplane (<math>\underline{\beta}</math>) has been considered.<br />
<br />
A major concern about gradient descent is that it may get trapped in local optimal solutions.<br />
<br />
====Some notes on the Perceptron Learning Algorithm====<br />
<br />
* If there is access to the training data points in a batch form, we should better take advantage of a closed optimization technique like least-squares or maximum-likelihood estimation for linear classifiers. (These closed solutions has been around many years before invention of the Perceptron).<br />
<br />
* Just like the linear classifier, a Perceptron can discriminate between only two classes at a time, and one can generalize its performance for multi-class problems by using one of the <math>k-1</math>, <math>k</math>, or <math>k(k-1)/2</math>-hyperplane methods.<br />
<br />
* If the two classes are linearly separable, the algorithm will converge in a finite number of iterations to a hyperplane, which makes the error of training data zero. The convergence is guaranteed if the learning rate is set adequately.<br />
<br />
* If the two classes are not linearly separable, the algorithm will never converge. So, one may think of a termination criterion in these cases. (e.g. a maximum number of iterations in which convergence is expected, or the rate of changes in both a cost function and its derivative).<br />
<br />
* In the case of linearly separable classes, the final solution and number of iterations will be dependent on the initial conditions, learning rate, and the gap between the two classes. In general, a smaller gap between classes requires a greater number of iterations for the algorithm to converge.<br />
<br />
* Learning rate --or updating step-- has a direct impact on both number of iterations and the accuracy of the solution for the optimization problem. Smaller quantities for this factor make convergence slower, even though we will end up with a more accurate solution. In the opposite way, larger values for learning rate make the process faster, even though we may lose some precision. So, one may make a balance for this trade-off in order to get fast enough to an accurate enough solution. (exploration vs. exploitation)<br />
<br />
In the upcoming lectures, we introduce the Support Vector Machines (SVM), which use a method similar in iterational optimization scheme to what the Perceptron suggests, but have a different definition for the cost function.<br />
<br />
===Universal Function Approximator===<br />
The universal function approximator is a mathematical formulation for a group of estimation techniques. The usual formulation for it is<br />
<br />
<math>\hat{Y}(x)=\sum\limits_{i=1}^{n}\alpha_i\sigma(\omega_i^Tx+b_i),</math><br />
<br />
where <math>\hat{Y}(x)</math> is an estimation for a function like <math>\,Y(x)</math>. According to the universal approximation theorem we have<br />
<br />
<math>|\hat{Y}(x) - Y(x)|<\epsilon,</math><br />
<br />
which means that <math>\hat{Y}(x)</math> can get as close to <math>\,Y(x)</math>, as necessary.<br />
<br />
This formulation assumes that the output, <math>\,Y(x)</math>, is a linear combination of a set of functions like <math>\,\sigma(.)</math> where <math>\,\sigma(.)</math> is a nonlinear function of the inputs or <math>\,x_i</math>s.<br />
<br />
====Generalization Factors====<br />
Even though this formulation represents a universal function approximator, which means that it can be fitted to a set of data as closely as demanded, the closeness of fit must be carefully decided upon. In many cases, the purpose of the model is to target unseen data. However, the fit to this unseen data is impossible to determine before it arrives.<br />
<br />
To overcome this dilemma, a common practice is to divide the test data points into two sets: training data and validation data. We use the training data to estimate the fixed parameters for the model, and then use the validation data to find values for the construction-dependent parameters. How these construction-dependent parameters vary depends on the model. In the case of a polynomial, the construction-dependent parameter would be its highest degree, and for a neural network, the construction-dependent parameter could be the number of hidden layers and the number of neurons in each layer.<br />
<br />
These matters on model generalization vs. complexity matters will be discussed with more detail in the lectures to follow.<br />
<br />
===Feed-Forward Neural Network===<br />
The Neural Network (NN) is one application of the universal function approximator. It can be thought of as a system of Perceptrons linked together as units of a network. One particular NN useful for classification is the Feed-Forward Neural Network (FFNN), which consists of multiple "hidden layers" of Perceptron units. Our discussion here is based around the FFNN, which has a toplogy shown in Figure 1. The first hidden layer of units receive input from the original features. Between the hidden layers, connections from each unit are always directed to units in the next adjacent layer. The output layer, which receives input only from the last hidden layer, each unit produces a target measurement for a distinct class (i.e. <math>\,K</math> classes require <math>\,K</math> units). In Figure 1, the units in a single layer are distributed vertically, and the inputs and outputs of the network are shown as the far left and right layers respectively.<br />
<br />
[[File:FFNN.png|300px|thumb|right|Fig.1 A common architecture for the FFNN]]<br />
<br />
====Mathematical Model of the FFNN with One Hidden Layer====<br />
The FFNN with one hidden layer for a <math>\,K</math>-class problem is defined as follows. Let <math>\,d</math> be the number of input features, <math>\,p</math> be the number of units in the hidden layer, and <math>\,K</math> be the number of classes (i.e. the number of units in the output layer).<br />
<br />
Each neural unit calculates its derived feature (i.e. output) using a linear combination of its inputs. Suppose <math>\,\underline{x}</math> is the <math>\,d</math>-dimensional vector of input features. Then, each neural unit uses a <math>\,d</math>-dimensional vector of weights to combine these input features: for the <math>\,i</math>th neural unit, let <math>\underline{u}_i</math> be this vector of weights. The linear combination calculated by the <math>\,i</math>th unit is then given by<br />
<br />
<math>a_i = \underline{u}_i^T\underline{x}</math><br />
<br />
However, we want the derived feature to lie between 0 and 1, so we apply an ''activating function'' <math>\,\sigma(a)</math>. The derived feature for the <math>\,i</math>th unit is then given by<br />
<br />
<math>\,z_i = \sigma(a_i)</math> where <math>\,\sigma</math> is typically the logistic function<br />
<br />
<math>\sigma(a) = \cfrac{1}{1+e^{-a}}</math><br />
<br />
Now, we place each of the derived features <math>\,z_i</math> from the hidden layer into a <math>\,p</math>-dimensional vector:<br />
<br />
<math>\underline{z} = \left[ \begin{array}{c} z_1 \\ z_2 \\ \vdots \\ z_p \end{array}\right]</math><br />
<br />
Like in the hidden layer, each unit in the output layer calculates its derived feature using a linear combination of its inputs. Each neural unit uses a <math>\,p</math>-dimensional vector of weights to combine the input features derived from the hidden layer. Let <math>\,\underline{w}_k</math> be this vector of weights used in the <math>\,k</math>th unit. The linear combination calculated by the <math>\,k</math>th unit is then given by<br />
<br />
<math>\hat{y}_k = \underline{w}_k^T\underline{z}</math><br />
<br />
<math>\,y_k</math> is thus the target measurement for the <math>\,k</math>th class. Note that an activation function <math>\,\sigma</math> is not used here.<br />
<br />
Notice that in each of the units, two operations take place:<br />
<br />
* a linear combination of the neuron's inputs is calculated using corresponding weights<br />
<br />
* a nonlinear operation on the linear combination is performed. <br />
<br />
These two calculations are shown in Figure 2. <br />
<br />
The nonlinear function <math>\,\sigma(.)</math> is called the activation function. Activation functions, like the logarithmic function shown earlier, are usually continuous and have a limited range. Another common activation function used in neural networks is <math>\,tanh(x)</math> (Figure 3).<br />
<br />
[[File:neuron2.png|300px|thumb|right|Fig.2 A general construction for a single neuron]]<br />
[[File:actfcn.png|300px|thumb|right|Fig.3 <math>tanh</math> as activation function]]<br />
<br />
The NN can be applied as a regression method or as a classifier, and the output layer differs depending on the application. The major difference between regression and classification is in the output space of the model, which is continuous in the case of regression, and discrete in the case of classification. For a regression task, no consideration is needed beyond what has already been mentioned earlier, since the outputs of the network would already be continuous. However, to use the neural network as a classifier, a threshold stage is necessary.<br />
<br />
====Mathematical Model of the FFNN with Multiple Hidden Layers====<br />
In the FFNN model with a single hidden layer, the derived features were represented as elements of the vector <math>\underline{z}</math>, and the original features were represented as elements of the vector <math>\underline{x}</math>. In the FFNN model with more than one hidden layer, <math>\underline{z}</math> is processed by the second hidden layer in the same way that <math>\underline{x}</math> was processed by the first hidden layer. Perceptrons in the second layer each use their own combination of weights to calculate a new set of derived features. These new derived features are processed by the third hidden layer in a similar way, and the cycle repeats for each additional hidden layer. This progression of processing is depicted in Figure 4.<br />
<br />
====Back-Propagation Learning Algorithm====<br />
<br />
[[File:bpl.png|300px|thumb|right|Fig.4 Labels for weights and derived features in the FFNN.]]<br />
<br />
Every linear-combination calculation in the FFNN involves weights that need to be set, and these weights are set using training data and an algorithm called Back-Propagation. This algorithm is similar to the gradient-descent algorithm introduced in the discussion of the Perceptron. The primary difference is that the gradient used in Back-Propagation is calculated in a more complicated way.<br />
<br />
First of all, we want to minimize the error between the estimated and true target measurements for the training data. That is, if <math>\,U</math> is the set of all weights in the FFNN, then we want to determine<br />
<br />
<math>\arg\min_U \left|y - \hat{y}\right|^2</math><br />
<br />
Now, suppose the hidden layers of the FFNN are labelled as in Figure 4. Then, we want to determine the derivative of <math>\left|y - \hat{y}\right|^2</math> with respect to each weight in the hidden layers of the FFNN. For weights <math>\,u_{jl}</math> this means we will need to find<br />
<br />
<math><br />
\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}}<br />
= \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j}\cdot<br />
\cfrac{\partial a_j}{\partial u_{jl}} = \delta_{j}z_l<br />
</math><br />
<br />
However, the closed-form solution for <math>\,\delta_{j}</math> is unknown, so we develop a recursive definition (<math>\,\delta_{j}</math> in terms of <math>\,\delta_{i}</math>):<br />
<br />
<math><br />
\delta_j = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j} <br />
= \sum_{i=1}^p \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_i}\cdot<br />
\cfrac{\partial a_i}{\partial a_j} <br />
= \sum_{i=1}^p \delta_i\cdot u_{ij} \cdot \sigma'(a_j)<br />
= \sigma'(a_j)\sum_{i=1}^p \delta_i \cdot u_{ij}<br />
</math><br />
<br />
We also need to determine the derivative of <math>\left|y - \hat{y}\right|^2</math> with respect to each weight in the ''output layer'' <math>\,k</math> of the FFNN (this layer is not shown in Figure 4, but it would be the next layer to the right of the rightmost layer shown). For weights <math>\,u_{ki}</math> this means<br />
<br />
<math><br />
\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{ki}}<br />
= \cfrac{\partial \left|y - \sum_i u_{ki}z_i\right|^2}{\partial u_{ki}}<br />
= -2(y - \sum_i u_{ki}z_i)z_i<br />
= -2(y - \hat{y})z_i<br />
</math><br />
<br />
With similarity to our computation of <math>\,\delta_j</math>, we define<br />
<br />
<math>\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_k}</math><br />
<br />
However, <math>\,a_k = \hat{y}</math> because an activation function is not applied in the output layer. So, our calculation becomes<br />
<br />
<math>\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial \hat{y}}<br />
= -2(y - \hat{y})</math><br />
<br />
Now that we have <math>\,\delta_k</math> and a recursive definition for <math>\,\delta_j</math>, it is clear that our weights can be deduced by starting from the output layer and working through the hidden layers through toward the input layer.<br />
<br />
Based on the above derivation, our algorithm for determining weights in the FFNN is as follows<br />
<br />
:1) Choose a random initial weights.<br />
<br />
:2) Apply a new datapoint <math>\underline{x}</math> to the FFNN as the input layer, and calculate the values for all units.<br />
<br />
:3) Compute <math>\,\delta_k = -2(y_k - \hat{y}_k)</math>.<br />
<br />
:4) Back-propagate layer-by-layer by computing <math>\delta_j = \sigma'(a_j)\sum_{i=1}^p \delta_i \cdot u_{ij}</math> for all units.<br />
<br />
:5) Compute <math>\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} = \delta_{j}z_l</math> for all weights <math>\,u_{jl}</math>.<br />
<br />
:6) Update <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}}<br />
- \rho \cdot \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} </math> for all weights <math>\,u_{jl}</math> where <math>\,\rho</math> is the learning rate.<br />
<br />
:7) If the termination criterion has not been met, go back to step 2 and apply another datapoint (ie. begin a new "epoch").<br />
<br />
====Alternative Description of the Back-Propagation Algorithm====<br />
Label the inputs and outputs of the <math>\,i</math>th hidden layer <math>\underline{x}_i</math> and <math>\underline{y}_i</math> respectively, and let <math>\,\sigma(.)</math> be the activation function for all neurons. We now have<br />
<br />
<math>\begin{align}<br />
\begin{cases}<br />
\underline{y}_1=\sigma(W_1.\underline{x}_1),\\<br />
\underline{y}_2=\sigma(W_2.\underline{x}_2),\\<br />
\underline{y}_3=\sigma(W_3.\underline{x}_3),<br />
\end{cases}<br />
\end{align}</math><br />
<br />
Where <math>\,W_i</math> is a matrix of the connection's weights, between two layers of <math>\,i</math> and <math>\,i+1</math>, and has <math>\,n_i</math> columns and <math>\,n_i+1</math> rows, where <math>\,n_i</math> is the number of neurons of the <math>\,i^{th}</math> layer.<br />
<br />
Considering this matrix equations, one can imagine a closed form for the derivative of the error in respect to the weights of the network. For a neural network with two hidden layers, the equations are as follows.<br />
<br />
<math>\begin{align}<br />
\frac{\partial E}{\partial W_3}=&diag(e).\sigma'(W_3.\underline{x}_3).(\underline{x}_3)^T,\\<br />
\frac{\partial E}{\partial W_2}=&\sigma'(W_2.\underline{x}_2).(\underline{x}_2)^T.diag\{\sum rows\{diag(e).diag(\sigma'(W_3.\underline{x}_3)).W_3\}\},\\<br />
\frac{\partial E}{\partial W_1}=&\sigma'(W_1.\underline{x}_1).(\underline{x}_1)^T.diag\{\sum rows\{diag(e).diag(\sigma'(W_3.\underline{x}_3)).W_3.diag(\sigma'(W_2.\underline{x}_2)).W_2\}\},<br />
\end{align}</math><br />
<br />
where <math>\,\sigma'(.)</math> is the derivative of the activation function <math>\,\sigma(.)</math>.<br />
<br />
Using this closed form derivative, it is possible to code the procedure for any number of layers and neurons. Here is a Matlab code for backpropagation algorithm. (<math>\,tanh</math> is utilized as the activation function.)<br />
<br />
<br />
while i < ep<br />
i = i + 1;<br />
data = shuffle(data,2);<br />
for j = 1:Q<br />
io = zeros(max(n)+1,length(n));<br />
gp = io;<br />
io(1:n(1)+1,1) = [1;data(1:f,j)];<br />
for k = 1:l<br />
io(2:n(k+1)+1,k+1) = w(2:n(k+1)+1,1:n(k)+1,k)*io(1:n(k)+1,k);<br />
gp(1:n(k+1)+1,k) = [0;1./(cosh(io(2:n(k+1)+1,k+1))).^2];<br />
io(1:n(k+1)+1,k+1) = [1;tanh(io(2:n(k+1)+1,k+1))];<br />
wg(1:n(k+1)+1,1:n(k)+1,k) = diag(gp(1:n(k+1)+1,k))*w(1:n(k+1)+1,1:n(k)+1,k);<br />
end<br />
e = [0;io(2:n(l+1)+1,l+1) - data(f+1:dd,j)];<br />
wg(1:n(l+1)+1,1:n(l)+1,l) = diag(e)*wg(1:n(l+1)+1,1:n(l)+1,l);<br />
gp(1:n(l+1)+1,l) = diag(e)*gp(1:n(l+1)+1,l);<br />
d = eye(n(l+1)+1);<br />
E(i) = E(i) + 0.5*norm(e)^2;<br />
for k = l:-1:1<br />
w(1:n(k+1)+1,1:n(k)+1,k) = w(1:n(k+1)+1,1:n(k)+1,k) - ro*diag(sum(d,1))*gp(1:n(k+1)+1,k)*(io(1:n(k)+1,k)');<br />
d = d*wg(1:n(k+1)+1,1:n(k)+1,k);<br />
end<br />
end<br />
end<br />
<br />
====Some notes on the neural network and its learning algorithm====<br />
<br />
* The activation functions are usually linear around the origin. If this is the case, choosing random weights between the <math>\,-0.5</math> and <math>\,0.5</math>, and normalizing the data may boost up the algorithm in the very first steps of the procedure, as the linear combination of the inputs and weights falls within the linear area of the activation function.<br />
<br />
* Learning of the neural network using backpropagation algorithm takes place in epochs. An Epoch is a single pass through the entire training set.<br />
<br />
* It is a common practice to randomly change the permutation of the training data in each one of the epochs, to make the learning independent of the data permutation.<br />
<br />
* Given a set of data for training a neural network, one should keep aside a ratio of it as the validation dataset, to obtain a sufficient number of layers and number of neurons in each of the layers. The best construction may be the one which leads to the least error for the validation dataset. Validation data may not be used as the training of the network.<br />
<br />
* We can also use the validation-training scheme to estimate how many epochs is enough for training the network.<br />
<br />
* It is also common to use other optimization algorithms as steepest descent and conjugate gradient in a batch form.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
==== Difficulties of training deep architecture <ref>{{Cite journal | title = Exploring Strategies for Training Deep Neural Networks | url = http://jmlr.csail.mit.edu/papers/volume10/larochelle09a/larochelle09a.pdf | year = 2009 | journal = Journal of Machine Learning Research | page = 1-40 | volume = 10 | last1 = Larochelle | first1 = H. | last2 = Bengio | first2 = Y. | last3 = Louradour | first3 = J. | last4 = Lamblin | first4 = P. }}</ref> ====<br />
<br />
Given a particular task, a natural way to train a deep network is to frame it as an optimization<br />
problem by specifying a supervised cost function on the output layer with respect to the desired<br />
target and use a gradient-based optimization algorithm in order to adjust the weights and biases<br />
of the network so that its output has low cost on samples in the training set. Unfortunately, deep<br />
networks trained in that manner have generally been found to perform worse than neural networks<br />
with one or two hidden layers.<br />
<br />
We discuss two hypotheses that may explain this difficulty. The first one is that gradient descent<br />
can easily get stuck in poor local minima (Auer et al., 1996) or plateaus of the non-convex training<br />
criterion. The number and quality of these local minima and plateaus (Fukumizu and Amari, 2000)<br />
clearly also influence the chances for random initialization to be in the basin of attraction (via<br />
gradient descent) of a poor solution. It may be that with more layers, the number or the width<br />
of such poor basins increases. To reduce the difficulty, it has been suggested to train a neural<br />
network in a constructive manner in order to divide the hard optimization problem into several<br />
greedy but simpler ones, either by adding one neuron (e.g., see Fahlman and Lebiere, 1990) or one<br />
layer (e.g., see Lengell´e and Denoeux, 1996) at a time. These two approaches have demonstrated to<br />
be very effective for learning particularly complex functions, such as a very non-linear classification<br />
problem in 2 dimensions. However, these are exceptionally hard problems, and for learning tasks<br />
usually found in practice, this approach commonly overfits.<br />
<br />
This observation leads to a second hypothesis. For high capacity and highly flexible deep networks,<br />
there actually exists many basins of attraction in its parameter space (i.e., yielding different<br />
solutions with gradient descent) that can give low training error but that can have very different generalization<br />
errors. So even when gradient descent is able to find a (possibly local) good minimum<br />
in terms of training error, there are no guarantees that the associated parameter configuration will<br />
provide good generalization. Of course, model selection (e.g., by cross-validation) will partly correct<br />
this issue, but if the number of good generalization configurations is very small in comparison<br />
to good training configurations, as seems to be the case in practice, then it is likely that the training<br />
procedure will not find any of them. But, as we will see in this paper, it appears that the type of<br />
unsupervised initialization discussed here can help to select basins of attraction (for the supervised<br />
fine-tuning optimization phase) from which learning good solutions is easier both from the point of<br />
view of the training set and of a test set.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
===Business Applications of Neural Networks===<br />
<br />
Neural networks are increasingly being used in real-world business applications and, in some cases, such as fraud detection, they have already become the method of choice. Their use for risk assessment is also growing and they have been employed to visualize complex databases for marketing segmentation. This method covers a wide range of business interests — from finance management, through forecasting, to production. The combination of statistical, neural and fuzzy methods now enables direct quantitative studies to be carried out without the need for rocket-science expertise.<br />
<br />
* On the Use of Neural Networks for Analysis Travel Preference Data <br />
* Extracting Rules Concerning Market Segmentation from Artificial Neural Networks <br />
* Characterization and Segmenting the Business-to-Consumer E-Commerce Market Using Neural Networks<br />
* A Neurofuzzy Model for Predicting Business Bankruptcy <br />
* Neural Networks for Analysis of Financial Statements <br />
* Developments in Accurate Consumer Risk Assessment Technology <br />
* Strategies for Exploiting Neural Networks in Retail Finance <br />
* Novel Techniques for Profiling and Fraud Detection in Mobile Telecommunications<br />
* Detecting Payment Card Fraud with Neural Networks<br />
* Money Laundering Detection with a Neural-Network <br />
* Utilizing Fuzzy Logic and Neurofuzzy for Business Advantage</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=7363stat841f102010-10-25T20:53:23Z<p>Hclam: /* Multi-Class Logistic Regression & Perceptron - October 19, 2010 */</p>
<hr />
<div>==[[Proposal Fall 2010]] ==<br />
==[[statf10841Scribe|Editor sign up]] ==<br />
{{Cleanup|date=October 8 2010|reason=Provide a summary for each topic here.}}<br />
== Summary ==<br />
=== Classification ===<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>.<br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>.<br />
=== Principle Component Analysis ===<br />
Principal component analysis (PCA) is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf]. Depending on where this methodology is applied, other common names of PCA include the [http://en.wikipedia.org/wiki/Karhunen%E2%80%93Lo%C3%A8ve_theorem Karhunen–Loève transform (KLT)] , the [http://en.wikipedia.org/wiki/Harold_Hotelling Hotelling transform], and the proper orthogonal decomposition (POD). PCA is the simplist [http://en.wikipedia.org/wiki/Eigenvector eigenvector]-based [http://en.wikipedia.org/wiki/Multivariate_analysis multivariate analysis]. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or '''principal components''') of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could.<br />
<br />
==[[f10_Stat841_digest |Digest ]] ==<br />
<br />
== ''' Reference Textbook''' ==<br />
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
== ''' Classification - September 21, 2010''' ==<br />
<br />
=== Classification ===<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which ones were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle (384 BC - 322 BC) when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression], [http://en.wikipedia.org/wiki/Cluster_analysis clustering], and [http://en.wikipedia.org/wiki/Dimension_reduction dimensionality reduction] (feature extraction or manifold learning). Please be noted that some people consider classification to be a broad area that consists of both supervised and unsupervised methods of classifying data. In this view, as can be seen in [http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm this link], clustering is simply a special case of classification and it may be called '''unsupervised classification'''.<br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers, a link to a source of which can be found [http://www.e-knowledge.ca/quotes.php?topic=Knowledge here].<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers <br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that can take a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
<br />
The '''true error rate''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
<br />
In practice, the empirical error rate is obtained to estimate the true error rate, whose value is impossible to be known because the parameter values of the underlying process cannot be known but can only be estimated using available data. The empirical error rate, in practice, estimates the true error rate quite well in that, as mentioned [http://www.liebertonline.com/doi/pdf/10.1089/106652703321825928 here], it is an unbiased estimator of the true error rate.<br />
<br />
=== Bayes Classifier ===<br />
<br />
{{Cleanup|date=October 14 2010|reason=In response to the previous tag: The naive Bayes classifier is a special (simpler) case of the Bayes classifier. It uses an extra assumption: that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. This assumption allows for an easier likelihood function <math>\,f_y(x)</math> in the equation:<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
The simper form of the likelihood function seen in the naive Bayes is:<br />
:<math><br />
\begin{align}<br />
f_y(x) = P(X=x|Y=y) = {\prod_{i=1}^{n} P(X_{i}=x_{i}|Y=y)}<br />
\end{align}<br />
</math><br />
The Bayes classifier taught in class was not the naive Bayes classifier. Perhaps a comment should be made about the naive Bayes classifier in the body of the text}}<br />
<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' Theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers [1]. Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests][2].<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into the most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class <math> y \in \mathcal{Y} </math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the posterior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h^*</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
The Bayes classifier is the optimal classifier in that it results in the least possible true probability of misclassification for any given new data input, i.e., for any generic classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief. As a result, the estimated values of the components in the trained model may deviate quite a bit from their true population values, and this can ultimately cause the calculated posterior probabilities of inputs to deviate quite a bit from their true values. Estimation of all these probability functions, as likelihood, prior probability, and evidence function is a very expensive task, computationally, which also makes some other classifiers more favorable than Bayes classifier.<br />
<br />
A detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here].<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h^*</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h^*)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
'''Theorem'''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course. ''Note: these are the known y values in the training data.'' <br />
<br />
These known data are summarized in the following tables:<br />
<br />
:[[File:裁剪.jpg]]<br />
{{Cleanup|date=September 2010|reason=this graph is not complete, the reason is that it should be in consistent with the computation below.}}<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
<br />
The Bayesian view of probability states that, for any event E, event E has a [http://en.wikipedia.org/wiki/Prior_probability prior probability] that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability of event E's occurrence, can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>. If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow. This is because one cannot possibly carry out trials for any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h^*)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the hyperplane the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and we shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that both of the two classes have [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distributions] and the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> to <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of <math>\,D(h^*)</math> is as follows:<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out alike terms and factoring).<br />
<br />
It is easy to see that, under LDA, the Bayes's classifier's decision boundary <math>\,D(h^*)</math> has the form <math>\,ax+b=0</math> and it is linear in <math>\,x</math>. This is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\left( \mu_m^\top\Sigma^{-1}<br />
\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n) \right)=0</math> . In addition, for any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between the centers (means) of <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice, where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
According to [http://www.lsv.uni-saarland.de/Vorlesung/Digital_Signal_Processing/Summer06/dsp06_chap9.pdf this link], some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that the data in each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - September 23, 2010''' ==<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>.<br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>.<br />
<br />
The derivation of the Bayes classifier's decision boundary <math>\,D(h^*)</math> under QDA is similar to that under LDA. Again, let us first consider the two-classes case where <math>\, \mathcal{Y}=\{0, 1\}</math>. This derivation is given as follows: <br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math> (by cancellation)<br />
:<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_1|)-\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1)+\log(\pi_1)=-\frac{1}{2}\log(|\Sigma_0|)-\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0)+\log(\pi_0)</math> (by taking the log of both sides)<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top\Sigma_1^{-1}x + \mu_1^\top\Sigma_1^{-1}\mu_1 - 2x^\top\Sigma_1^{-1}\mu_1 - x^\top\Sigma_0^{-1}x - \mu_0^\top\Sigma_0^{-1}\mu_0 + 2x^\top\Sigma_0^{-1}\mu_0 \right)=0</math> (by expanding out)<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top(\Sigma_1^{-1}-\Sigma_0^{-1})x + \mu_1^\top\Sigma_1^{-1}\mu_1 - \mu_0^\top\Sigma_0^{-1}\mu_0 - 2x^\top(\Sigma_1^{-1}\mu_1-\Sigma_0^{-1}\mu_0) \right)=0</math> <br />
<br />
It is easy to see that, under QDA, the decision boundary <math>\,D(h^*)</math> has the form <math>\,ax^2+bx+c=0</math> and it is quadratic in <math>\,x</math>. This is where the word ''quadratic'' in quadratic discriminant analysis comes from.<br />
<br />
As is the case with LDA, QDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\log(\frac{|\Sigma_m|}{|\Sigma_n|})-\frac{1}{2}\left( x^\top(\Sigma_m^{-1}-\Sigma_n^{-1})x + \mu_m^\top\Sigma_m^{-1}\mu_m - \mu_n^\top\Sigma_n^{-1}\mu_n - 2x^\top(\Sigma_m^{-1}\mu_m-\Sigma_n^{-1}\mu_n) \right)=0</math>.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h^*(x) = \arg\max_{k} \delta_k(x)</math> <br />
where, <br />
* In the case of LDA, which assumes that a common covariance matrix is shared by all classes, <math> \,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is linear in <math>\,x</math>.<br />
<br />
* In the case of QDA, which assumes that each class has its own covariance matrix, <math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is quadratic in <math>\,x</math>.<br />
<br />
<br />
'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
[http://www.stat.cmu.edu/~larry/=stat707/notes10.pdf See Theorem 46.6 Page 133]<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the Maximum Likelihood estimates from the sample for <math>\,\pi,\mu_k,\Sigma_k</math> in place of their true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance, denoted <math>\Sigma</math>, is defined as the weighted average of the covariance for each class. <br />
<br />
In the case where we need a common covariance matrix, we get the estimate using the following equation:<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{n-k} </math><br />
<br />
Where: <math>\,n_r</math> is the number of data points in class r, <math>\,\Sigma_r</math> is the covariance of class r and <math>\,n</math> is the total number of data points,<br />
<math>\,k</math> is the number of classes.<br />
<br />
See the details about the [http://en.wikipedia.org/wiki/Estimation_of_covariance_matrices estimation of covarience matrices].<br />
<br />
===Computation For QDA And LDA===<br />
<br />
First, let us consider QDA, and examine each of the following two cases.<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
<math>\, \Sigma_k = I </math> for every class <math>\,k</math> implies that our data is spherical. This means that the data of each class <math>\,k</math> is distributed symmetrically around the center <math>\,\mu_k</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{-1}{2}log(|I|)</math>, is zero since <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximize <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = U_kS_kV_k^\top = U_kS_kU_k^\top </math> (In general when <math>\,X=U_kS_kV_k^\top</math>, <math>\,U_k</math> is the eigenvectors of <math>\,X_kX_k^T</math> and <math>\,V_k</math> is the eigenvectors of <math>\,X_k^\top X_k</math>. <br />
So if <math>\, X_k</math> is symmetric, we will have <math>\, U_k=V_k</math>. Here <math>\, \Sigma_k </math> is symmetric, because it is the covariance matrix of <math> X_k </math>) and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (U_kS_kU_k^\top)^{-1} = (U_k^\top)^{-1}S_k^{-1}U_k^{-1} = U_kS_k^{-1}U_k^\top </math> (since <math>\,U_k</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top U_kS_k^{-1}U_k^T(x-\mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-1}(U_k^\top x-U_k^\top \mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top x-U_k^\top\mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the squared Euclidean distance between <math> \, S_k^{-\frac{1}{2}}U_k^\top x </math> and <math>\, S_k^{-\frac{1}{2}}U_k^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>.<br />
<br />
A similar transformation of all the centers can be done from <math>\,\mu_k</math> to <math>\,\mu_k^*</math> where <math> \, \mu_k^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top \mu_k </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math> and <math>\,\mu_k^*</math>, treating them as in Case 1 above.<br />
<br />
{{Cleanup|date=October 18 2010|reason=The sentence above may cause some misleading. In general case, <math>\,\Sigma_k </math> may not be the same . So you can't treat them completely the same as in Case 1 above. You need to compute <math>\, log{|\Sigma_k |} </math> differently. Here is a detailed discussion below:}}<br />
{{Cleanup|date=October 18 2010|reason=The sentence above is right since by transforming<math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>, the new variable variance is <math>I</math>}}<br />
<br />
<br />
Note that when we have multiple classes, we also need to compute <math>\, log{|\Sigma_k|}</math> respectively. Then we compute <math> \,\delta_k </math> for QDA .<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, in another word, have same covariance <math>\,\Sigma_k</math>,else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
{{Cleanup|date=October 18 2010|reason=The statement above may not be true, because in assignment 1, we did do the QDA computation using this approach although the corresponding three covarience matrices are different, the reason why the answer is Yes is as below }}<br />
<br />
The answer is Yes. Consider that you have two classes with different shapes. Given a data point, justify which class this point belongs to. You just do the transformations corresponding to the 2 classes respectively, then you get <math>\,\delta_1 ,\delta_2 </math> ,then you determine which class the data point belongs to by comparing <math> \,\delta_1 </math> and <math> \,\delta_2 </math> .<br />
<br />
In summary, to apply QDA on a data set <math>\,X</math>, in the general case where <math>\, \Sigma_k \ne I </math> for each class <math>\,k</math>, one can proceed as follows:<br />
<br />
:: Step 1: For each class <math>\,k</math>, apply singular value decomposition on <math>\,X_k</math> to obtain <math>\,S_k</math> and <math>\,U_k</math>.<br />
<br />
:: Step 2: For each class <math>\,k</math>, transform each <math>\,x</math> belonging to that class to <math>\,x^* = S_k^{-\frac{1}{2}}U_k^\top x</math>, and transform its center <math>\,\mu_k</math> to <math>\,\mu_k^* = S_k^{-\frac{1}{2}}U_k^\top \mu_k</math>.<br />
<br />
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x^*</math> and the transformed center <math>\,\mu^*</math> of each class, and assign <math>\,x</math> to the class such that the squared Euclidean distance between <math>\,x^*</math> and <math>\,\mu^*</math> is the least over all of the classes.<br />
<br />
<br />
Now, let us consider LDA. <br />
Here, one can derive a classification scheme that is quite similar to that shown above. The main difference is the assumption of a common variance across the classes, so we perform the Singular Value Decomposition once, as opposed to k times.<br />
<br />
To apply LDA on a data set <math>\,X</math>, one can proceed as follows:<br />
<br />
:: Step 1: Apply singular value decomposition on <math>\,X</math> to obtain <math>\,S</math> and <math>\,U</math>.<br />
<br />
:: Step 2: For each <math>\,x \in X</math>, transform <math>\,x</math> to <math>\,x^* = S^{-\frac{1}{2}}U^\top x</math>, and transform each center <math>\,\mu</math> to <math>\,\mu^* = S^{-\frac{1}{2}}U^\top \mu</math>.<br />
<br />
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x^*</math> and the transformed center <math>\,\mu^*</math> of each class, and assign <math>\,x</math> to the class such that the squared Euclidean distance between <math>\,x^*</math> and <math>\,\mu^*</math> is the least over all of the classes.<br />
<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In actual data scenarios, it is generally true that QDA will provide a better classifier for the data then LDA because QDA does not assume that the covariance matrix for each class is identical, as LDA assumes. However, QDA still assumes that the class conditional distribution is Gaussian, which is not always the case in real-life scenarios. The link provided at the beginning of this paragraph describes a kernel-based QDA method which does not have the Gaussian distribution assumption.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA - September 28, 2010==<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension. Pay attention, We don't do QDA with LDA. If we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below applies LDA to the same data set and reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that the algorithm created to separate the data into the two classes.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the classes of the points 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x*y+%g*y^2', k, l(1), l(2), q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that QDA is only correct in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve produced by QDA that do not lie on the correct side of the line produced by LDA.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Principal Component Analysis - September 30, 2010==<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br />
<br /><br />
Principal component analysis (PCA) is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf]. Depending on where this methodology is applied, other common names of PCA include the [http://en.wikipedia.org/wiki/Karhunen%E2%80%93Lo%C3%A8ve_theorem Karhunen–Loève transform (KLT)] , the [http://en.wikipedia.org/wiki/Harold_Hotelling Hotelling transform], and the proper orthogonal decomposition (POD). PCA is the simplist [http://en.wikipedia.org/wiki/Eigenvector eigenvector]-based [http://en.wikipedia.org/wiki/Multivariate_analysis multivariate analysis]. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or '''principal components''') of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could. <br />
<br />
<br />
Furthermore, if one considers the lower dimensional representation produced by PCA as a least squares fit of our original data, then it can also be easily shown that this representation is the one that minimizes the reconstruction error of our data. It should be noted however, that one usually does not have control over which dimensions PCA deems to be the most informative for a given set of data, and thus one usually does not know which dimensions PCA selects to be the most informative dimensions in order to create the lower-dimensional representation. <br />
<br />
<br />
Suppose <math>\,X</math> is our data matrix containing <math>\,d</math>-dimensional data. The idea behind PCA is to apply [http://en.wikipedia.org/wiki/Singular_value_decomposition singular value decomposition] to <math>\,X</math> to replace the rows of <math>\,X</math> by a subset of it that captures as much of the [http://en.wikipedia.org/wiki/Variance variance] in <math>\,X</math> as possible. First, through the application of singular value decomposition to <math>\,X</math>, PCA obtains all of our data's directions of variation. These directions would also be ordered from left to right, with the leftmost directions capturing the most amount of variation in our data and the rightmost directions capturing the least amount. Then, PCA uses a subset of these directions to map our data from its original space to a lower-dimensional space. <br />
<br />
<br />
By applying singular value decomposition to <math>\,X</math>, <math>\,X</math> is decomposed as <math>\,X = U\Sigma V^T \,</math>. The <math>\,d</math> columns of <math>\,U</math> are the [http://en.wikipedia.org/wiki/Eigenvector eigenvectors] of <math>\,XX^T \,</math>.<br />
The <math>\,d</math> columns of <math>\,V</math> are the eigenvectors of <math>\,X^TX \,</math>. The <math>\,d</math> diagonal values of <math>\,\Sigma</math> are the square roots of the [http://en.wikipedia.org/wiki/Eigenvalue eigenvalues] of <math>\,XX^T \,</math> (also of <math>\,X^TX \,</math>), and they correspond to the columns of <math>\,U</math> (also of <math>\,V</math>). <br />
<br />
<br />
We are interested in <math>\,U</math>, whose <math>\,d</math> columns are the <math>\,d</math> directions of variation of our data. Ordered from left to right, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most informative direction of variation of our data. That is, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most effective column in terms of capturing the total variance exhibited by our data. A subset of the columns of <math>\,U</math> is used by PCA to reduce the dimensionality of <math>\,X</math> by projecting <math>\,X</math> onto the columns of this subset. In practice, when we apply PCA to <math>\,X</math> to reduce the dimensionality of <math>\,X</math> from <math>\,d</math> to <math>\,k</math>, where <math>k < d\,</math>, we would proceed as follows:<br />
<br />
:: Step 1: Center <math>\,X</math> so that it would have zero mean.<br />
<br />
:: Step 2: Apply singular value decomposition to <math>\,X</math> to obtain <math>\,U</math>.<br />
<br />
:: Step 3: Suppose we denote the resulting <math>\,k</math>-dimensional representation of <math>\,X</math> by <math>\,Y</math>. Then, <math>\,Y</math> is obtained as <math>\,Y = U_k^TX</math>. Here, <math>\,U_k</math> consists of the first (leftmost) <math>\,k</math> columns of <math>\,U</math> that correspond to the <math>\,k</math> largest diagonal elements of <math>\,\Sigma</math>.<br />
<br />
<br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 130 images (28 by 23 pixels) of handwritten threes. <br />
{{Cleanup|date=September 6 2010|reason=If anyone can tell me where I can find the 2-3 data set, I would create the new image. In the mean time, I found a non-copyrighted image of different looking 3s online, but as you can see, it is not as nice as one we could make.}}<br />
{{Cleanup|date=September 6 2010|reason=I think you can find it on your UW-ACE account for this course.}}<br />
<br />
[[File:Handwritten 3s.gif]]<br />
<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 130 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[Image:23plotPCA.jpg]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slant of each handwritten digit.<br />
<br />
===Derivation of the first Principle Component===<br />
{{Cleanup|date=October 2010|reason=I think English of this section must be improved}}<br />
We want to find the direction of maximum variation. Let <math>\begin{align}\textbf{w}\end{align}</math> be an arbitrary direction, <math>\begin{align}\textbf{x}\end{align}</math> a data point and <math>\begin{align}\displaystyle u\end{align}</math> the length of the projection of <math>\begin{align}\textbf{x}\end{align}</math> in direction <math>\begin{align}\textbf{w}\end{align}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\begin{align}\textbf{w}\end{align}</math> is the same as <math>\begin{align}c\textbf{w}\end{align}</math>, for any scalar <math>c</math>, so without loss of generality, we assume that: <br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}.<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}. <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} .</math><br />
<br /><br /><br />
The above is the variance of <math>\begin{align}\displaystyle u \end{align}</math> formed by the weight vector <math>\begin{align}\textbf{w} \end{align}</math>. The first principal component is the vector <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the function. Our goal is to find the weight <math>\begin{align}\textbf{w} \end{align}</math> that maximizes this variability, subject to a constraint. Since our function is convex, it has no maximum value. Therefore we need to add a constraint that restricts the length of <math>\begin{align}\textbf{w} \end{align}</math>. However, we are only interested in the direction of the variability, so the problem becomes<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
s.t. <math>\textbf{w}^T \textbf{w} = 1.</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|.<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
====Lagrange Multiplier====<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector then the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
{{Cleanup|date=October 2010|reason=It is good discussion, what will happen if we don't have distinct eigenvalues and eigenvectors? What does this situation mean? }}<br />
{{Cleanup|date=October 2010|reason=If the eigenvalues are not distinct, I suppose we could still take the leftmost eigenvector by default. Not sure if this is the correct approach, so can anyone please explain further? Thanks }}<br />
{{Cleanup|date=October 2010|reason= As U is the eigenvector of a symetric matrix, is it possible that we have 2 similar eigen vector? }}<br />
<br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = Var(\sum_{i = 1}^n x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
m_X=mean(X,2);<br />
mm=repmat(m_X,1,300);<br />
XX=X-mm;<br />
[u s v] = svd(XX);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
xHat=xHat+mm;<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. This is because almost all of the noise in the noisy image is captured by the principal components (directions of variation) that capture the least amount of variation in the image, and these principal components were discarded when we used the few principal components that capture most of the image's variation to generate the image's lower-dimensional representation. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Extraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data.<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized as follows (taken from the Lecture Slides).<br />
<br />
====Algorithm ====<br />
'''Recover basis:''' Calculate <math> XX^T =\Sigma_{i=1}^{n} x_i x_{i}^{T} </math> and let <math> U=</math> eigenvectors of <math> X X^T </math> corresponding to the top <math> d </math> eigenvalues.<br />
<br />
'''Encoding training data:''' Let <math>Y=U^TX </math> where <math>Y</math> is a <math>d \times n</math> matrix of encoding of the original data.<br />
<br />
'''Reconstructing training data:''' <math>\hat{X}= UY=UU^TX </math>.<br />
<br />
'''Encode set example:''' <math> y=U^T x </math> where <math> y </math> is a <math>d-</math>dimentional encoding of <math>x</math>.<br />
<br />
'''Reconstruct test example:''' <math>\hat{x}= Uy=UU^Tx </math>.<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{D \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, U_dY_{d \times n} </math>, where <math>\,U_d</math> contains the first (leftmost) <math>\,d</math> columns of <math>\,U</math>.<br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
== Fisher's (Linear) Discriminant Analysis (FDA) - Two Class Problem - October 5, 2010 ==<br />
<br />
===Sir Ronald A. Fisher===<br />
Fisher's Discriminant Analysis (FDA), also known as Fisher's Linear Discriminant Analysis (LDA) in some sources, is a classical feature extraction technique. It was originally described in 1936 by Sir [http://en.wikipedia.org/wiki/Ronald_A._Fisher Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here]. <br />
In this paper Fisher used for the first time the term DISCRIMINANT FUNCTION. The term DISCRIMINANT ANALYSIS was introduced later by Fisher himself in a subsequent paper which can be found [http://digital.library.adelaide.edu.au/coll/special//fisher/155.pdf here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
As in PCA, the goal of FDA is to project the data in a lower dimension. You might ask, why was FDA invented when PCA already existed? There is a simple explanation for this that can be found [http://www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf here]. PCA is an unsupervised method for classification, so it does not take into account the labels in the data. Suppose we have two clusters that have very different or even opposite labels from each other but are nevertheless positioned in a way such that they are very much parallel to each other and also very near to each other. In this case, most of the total variation of the data is in the direction of these two clusters. If we use PCA in cases like this, then both clusters would be projected onto the direction of greatest variation of the data to become sort of like a single cluster after projection. PCA would therefore mix up these two clusters that, in fact, have very different labels. What we need to do instead, in this cases like this, is to project the data onto a direction that is orthogonal to the direction of greatest variation of the data. This direction is in the least variation of the data. On the 1-dimensional space resulting from such a projection, we would then be able to effectively classify the data, because these two clusters would be perfectly or nearly perfectly separated from each other taking into account of their labels. This is exactly the idea behind FDA. <br />
<br />
The main difference between FDA and PCA is that, in FDA, in contrast to PCA, we are not interested in retaining as much of the variance of our original data as possible. Rather, in FDA, our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for a direction that is most representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
Suppose we have 2-dimensional data, then FDA would attempt to project the data of each class onto a point in such a way that the resulting two points would be as far apart from each other as possible. Intuitively, this basic idea behind FDA is the optimal way for separating each pair of classes along a certain direction. <br />
<br />
{{Cleanup|date=October 2010|reason= Just a thought: how relevant is "Dimensionality reduction techniques" to the concept of "subspace clustering"? As in subspace clustering, the goal is to find a set of features (relevant features, the concept is referred to as local feature relevance in the literature) in the high dimensional space, where potential subspaces accommodating different classes of data points can be defined. This means; the data points are dense when they are considered in a subset of dimensions (features).}}<br />
{{Cleanup|date=October 2010|reason=If I'm not mistaken, classification techniques like FDA use labeled training data whereas clustering techniques use unlabeled training data instead. Any other input regarding this would be much appreciated. Thanks}}<br />
{{Cleanup|date=October 2010|reason=An extension of clustering is subspace clustering in which different subspace are searched through to find the relavant and appropriate dimentions. High dimentional data sets are roughly equiedistant from each other, so feature selection methods are used to remove the irrelavant dimentions. These techniques do not keep the relative distance so PCA is not useful for these applications. It should be noted that subspace clustering localize their search unlike feature selection algorithms.for more information click here[http://portal.acm.org/citation.cfm?id=1007731]}}<br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes:<br />
<br><br />
For a 2-classes problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k-classes problem, we want to reduce the data to k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within-variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
{{Cleanup|date=October2010|reason=Anyone please add an example to make the comparison clearer}}<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. In the paper,"[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] " written by our instructor, they propose a closed-form solution to one algorithm that previously required expensive semidefinite optimization. They provide a new problem setup in which the algorithm performs better or as well as some standard methods, but without the computational complexity. Furthermore, they show a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). They also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
===FDA Goals===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
==== Example in R ====<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
<br />
FDA projects the data into lower dimensional space, where the distances between the projected means are maximum and the within-class variances are minimum. There are two categories of classification problems:<br />
<br />
1. Two-class problem<br />
<br />
2. Multi-class problem (addressed next lecture)<br />
<br />
=== Two-class problem ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within each class''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances (the summation of the two covariances is a valid covariance, satisfying the symmetry and positive semi-definite criteria). <br />
<br />
{{Cleanup|date=October 2010|reason=In 2. above, I wonder if the computation would be much more complex if we instead find a weighted sum of the covariances of the two classes where the weights are the sizes of the two classes?}}<br />
<br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> where <math>\ z_i </math> is a scalar<br />
<br />
====1. Minimizing within-class variance==== <br />
<math>\displaystyle \min_w (\underline{w}^T\sum_1\underline{w}) </math><br />
<br />
<math>\displaystyle \min_w (\underline{w}^T\sum_2\underline{w}) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math><br />
<br> (where <math>\,\sum_1</math> and <math>\,\sum_2 </math> are the covariance matrices of the 1st and 2nd classes of data, respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within-classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math>.<br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br /> <br />
<math>\displaystyle \max_w ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2, </math><br />
<br /><br /><br />
<math>\begin{align} ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2 &= (\underline{w}^T \mu_1 - \underline{w}^T \mu_2)^T(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\<br />
&= (\mu_1^T\underline{w} - \mu_2^T\underline{w})(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\<br />
&= (\mu_1 - \mu_2)^T \underline{w} \underline{w}^T (\mu_1 - \mu_2) \\<br />
<br />
&= ((\mu_1 - \mu_2)^T \underline{w})^{T} (\underline{w}^T (\mu_1 - \mu_2))^{T} \\<br />
&= \underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T \underline{w} \end{align}</math><br /><br />
<br />
Note that in the last line above the order is rearranged clockwise because the answer is a scalar.<br />
<br />
Let <math>\displaystyle s_B = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math>, the between-class covariance, then the goal is to <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math>.<br />
<br />
===The Objective Function for FDA===<br />
We want an objective function which satisfies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math> or <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math><br />
# <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math> <br />
<br /><br /><br />
So, we construct our objective function as maximizing the ratio of the two goals brought above:<br /><br />
<br /><br />
<math>\displaystyle \max_w \frac{(\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w})} {(\underline{w}^T(\sum_1 + \sum_2)\underline{w})} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max_w \frac{(\underline{w}^Ts_B\underline{w})}{(\underline{w}^Ts_w\underline{w})}</math> <br /><br />
One may argue that we can use subtraction for this purpose, while this approach is true but it can be shown it will need another scaling factor. Thus using this ratio is more efficient.<br />
<br />
As the objective function is convex, and so it does not have a maximum. To get around this problem, we have to add the constraint that w must have unit length, and then solvethis optimization problem we form the lagrangian:<br />
<br />
<br /><br /><br />
<math>\displaystyle L(\underline{w},\lambda) = \underline{w}^Ts_B\underline{w} - \lambda (\underline{w}^Ts_w\underline{w} -1)</math><br /><br /><br />
<br />
<br /><br />
Then, we equate the partial derivative of L with respect to <math>\underline{w}</math>:<br />
<math>\displaystyle \frac{\partial L}{\partial \underline{w}}=2s_B \underline{w} - 2\lambda s_w \underline{w} = 0 </math> <br /><br />
<br />
<math>s_B \underline{w} = \lambda s_w \underline{w}</math><br /><br />
<math>s_w^{-1}s_B \underline{w}= \lambda\underline{w}</math><br /><br /><br />
This is in the form of generalized eigenvalue problem. Therefore, <math> \underline{w}</math> is the largest eigenvector of <math>s_w^{-1}s_B </math><br /><br />
<br />
This solution can be further simplified as follow:<br /><br />
<br />
<math>s_w^{-1}(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w} = \lambda\underline{w} </math><br /><br />
<br />
Since <math>(\mu_1 - \mu_2)^T\underline{w}</math> is a scalar then <math>s_w^{-1}(\mu_1 - \mu_2)</math>∝<math>\underline{w}</math> <br /><br /><br />
This gives the direction of <math>\underline{w}</math> without doing eigenvalue decomposition in the case of 2-class problem.<br />
<br />
Note: In order for <math>{s_w}</math> to have an inverse, it must have full rank. This can be achieved by ensuring that the number of data points <math>\,\ge</math> the dimensionality of <math>\underline{x_{i}}</math>.<br />
<br />
===FDA Using Matlab===<br />
Note: ''The following example was not actually mentioned in this lecture''<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods. <br />
%First of all, we generate the two data set:<br />
% First data set X1<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300);<br />
%In this case: <br />
mu_1=[1;1]; <br />
Sigma_1=[1 1.5; 1.5 3]; <br />
%where mu and sigma are the mean and covariance matrix.<br />
% Second data set X2<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300); <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
plot(X1(:,1),X1(:,2),'.b'); hold on;<br />
plot(X2(:,1),X2(:,2),'ob')<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
% Combine data sets to map both into the same subspace<br />
X=[X1;X2];<br />
X=X';<br />
% We used built-in PCA function in Matlab<br />
[coefs, scores]=princomp(X);<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=sw\[4; 2] % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is very little overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
===Some of FDA applications===<br />
There are many applications for FDA in many domains some of them are stated below:<br />
<br />
* SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS<br />
FDA can be used to enhance listening comprehension when the user goes from a sound<br />
environment to another different one. For more information review this paper by Alexandre et al.[http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/papers/1569101740.pdf here]<br />
<br />
* Application to Face Recognition<br />
FDA can be used in face recognition at different situation. Using FDA Kong et al. proposes an Application to Face<br />
Recognition with Small Number of Training Samples [http://person.hst.aau.dk/pimuller/2D_FDA_Face_CVPR05fish.pdf here].<br />
<br />
* Palmprint Recognition<br />
FDA is used in biometrics, to implement an automated palmprint recognition system. See An Automated Palmprint Recognition System by Tee et al. [http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6V09-4FJ5XPN-1-1&_cdi=5641&_user=1067412&_pii=S0262885605000089&_origin=search&_coverDate=05%2F01%2F2005&_sk=999769994&view=c&wchp=dGLbVzz-zSkWb&md5=a064b67c9bdaaba7e06d800b6c9b209b&ie=/sdarticle.pdf here].<br />
<br />
{{Cleanup|date=October 2010|reason=I think briefing about the other applications would be easier than browsing through all of these applications}}<br />
<br />
{{Cleanup|date=October 2010|reason= This link is no longer valid.}}<br />
<br />
other applications could found in references 4,5,6,7,8 and more in [http://www.sciencedirect.com/science?_ob=ArticleListURL&_method=list&_ArticleListID=1489148820&_sort=r&_st=13&view=c&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=f210273546a659c90ae0962fce7b8b4e&searchtype=a here]<br />
<br />
=== '''References'''===<br />
1. Kong, H.; Wang, L.; Teoh, E.K.; Wang, J.-G.; Venkateswarlu, R.; , "A framework of 2D Fisher discriminant analysis: application to face recognition with small number of training samples," Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , vol.2, no., pp. 1083- 1088 vol. 2, 20-25 June 2005<br />
doi: 10.1109/CVPR.2005.30<br />
[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1467563&isnumber=31473 1]<br />
<br />
2. Enrique Alexandre, Roberto Gil-Pita, Lucas Cuadra, Lorena A´lvarez, Manuel Rosa-Zurera, "SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS USING A TWO-LAYER CLASSIFICATION SYSTEM WITH MSE LINEAR DISCRIMINANTS", 16th European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland, August 25-29, 2008, copyright by EURASIP, [http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/welcome.html 2]<br />
<br />
3. Connie, Tee; Jin, Andrew Teoh Beng; Ong, Michael Goh Kah; Ling, David Ngo Chek; "An automated palmprint recognition system", Journal of Image and Vision Computing, 2005. [http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V09-4FJ5XPN-1&_user=1067412&_coverDate=05/01/2005&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_searchStrId=1489147048&_rerunOrigin=google&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=a781a68c29fbf127473ae9baa5885fe7&searchtype=a 3]<br />
<br />
4. met, Francesca; Boqué, Ricard; Ferré, Joan; "Application of non-negative matrix factorization combined with Fisher's linear discriminant analysis for classification of olive oil excitation-emission fluorescence spectra", Journal of Chemometrics and Intelligent Laboratory Systems, 2006.<br />
[http://www.sciencedirect.com/science/article/B6TFP-4HR769Y-1/2/b5244d459265abb3a1bf5238132c737e 4]<br />
<br />
5. Chiang, Leo H.;Kotanchek, Mark E.;Kordon, Arthur K.; "Fault diagnosis based on Fisher discriminant analysis and support vector machines"<br />
Journal of Computers & Chemical Engineering, 2004<br />
[http://www.sciencedirect.com/science/article/B6TFT-4B4XPRS-1/2/bca7462236924d29ea23ec633a6eb236 5]<br />
<br />
6. Yang, Jian ;Frangi, Alejandro F.; Yang, Jing-yu; "A new kernel Fisher discriminant algorithm with application to face recognition", 2004<br />
[http://www.sciencedirect.com/science/article/B6V10-4997WS1-1/2/78f2d27c7d531a3f5faba2f6f4d12b45 6]<br />
<br />
7. Cawley, Gavin C.; Talbot, Nicola L. C.; "Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers", Journal of Pattern Recognition , 2003 [http://www.sciencedirect.com/science/article/B6V14-492718R-1/2/bd6e5d0495023a1db92ab7169cc96dde 7]<br />
<br />
8. Kodipaka, S.; Vemuri, B.C.; Rangarajan, A.; Leonard, C.M.; Schmallfuss, I.; Eisenschenk, S.; "Kernel Fisher discriminant for shape-based classification in epilepsy" Hournal Medical Image Analysis, 2007. [http://www.sciencedirect.com/science/article/B6W6Y-4MH8BS0-1/2/055fb314828d785a5c3ca3a6bf3c24e9 8]<br />
<br />
==Fisher's (Linear) Discriminant Analysis (FDA) - Multi-Class Problem - October 7, 2010==<br />
<br />
====Obtaining Covariance Matrices====<br />
<br />
<br />
The within-class covariance matrix <math>\mathbf{S}_{W}</math> is easy to obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between-class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to compute directly. To bypass this problem we use the following method. We know that the total covariance <math>\,\mathbf{S}_{T}</math> of a given set of data is constant and known, and we can also decompose this variance into two parts: the within-class variance <math>\mathbf{S}_{W}</math> and the between-class variance <math>\mathbf{S}_{B}</math> in a way that is similar to [http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA]. We thus have:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
where the total variance is given by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = <br />
\frac{1}{n}<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
We can now get <math>\mathbf{S}_{B}</math> from the relationship: <br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
<br />
Actually, there is another way to obtain <math>\mathbf{S}_{B}</math>. Suppose the data contains <math>\, k </math> classes, and each class <math>\, j </math> contains <math>\, n_{j} </math> data points. We denote the overall mean vector by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\frac{1}{n} \sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within-class covariance <math>\mathbf{S}_{W}</math><br />
and the between-class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term in the final line of the derivation above as the between-class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^* =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})^{T})<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))((\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W})<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the following as our measure:<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to the largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
{{Cleanup|date=What if we encounter complex eigenvalues? Then concept of being large does not dense. What is the solution in that case? }}<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
<br />
Recall that the Frobenius norm of <math>X</math> is <br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2_{2} = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have the following classic criterion function that Fisher used<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to the largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
{{Cleanup|date=October 2010|reason=Adding more general comments about the advantages and flaws of FDA would be effective here.}}<br />
<br />
{{Cleanup|date=October 2010|reason=Would you please show how could we reconstruct our original data from the data that its dimentionality is reduced by FDA.}}<br />
{{Cleanup|date=October 2010|reason= When you reduce the dimensionality of data in most general form you lose some features of the data and you cannot reconstruct the data from redacted space unless the data have special features that help you in reconstruction like sparsity. In FDA it seems that we cannot reconstruct data in general form using reducted version of data }}<br />
<br />
===Generalization of Fisher's Linear Discriminant Analysis ===<br />
<br />
Fisher's Linear Discriminant Analysis (Fisher, 1936) is very popular among users of discriminant analysis. Some of the reasons for this are its simplicity<br />
and lack of necessity for strict assumptions. However, it has optimality properties only if the underlying distributions of the groups are multivariate normal. It is also easy to verify that the discriminant rule obtained can be very harmed by only a small number of outlying observations. Outliers are very hard to detect in multivariate data sets and even when they are detected simply discarding them is not the most efficient way of handling the situation. Therefore, there is a need for robust procedures that can accommodate the outliers and are not strongly affected by them. Then, a generalization of Fisher's linear discriminant algorithm [[http://www.math.ist.utl.pt/~apires/PDFs/APJB_RP96.pdf]] is developed to lead easily to a very robust procedure.<br />
<br />
Also notice that LDA can be seen as a dimensionality reduction technique. In general k-class problems, we have k means which lie on a linear subspace with dimension k-1. Given a data point, we are looking for the closest class mean to this point. In LDA, we project the data point to the linear subspace and calculate distances within that subspace. If the dimensionality of the data, d, is much larger than the number of classes, k, then we have a considerable drop in dimensionality from d dimensions to k - 1 dimensions.<br />
<br />
==Linear and Logistic Regression - October 12, 2010==<br />
<br />
===Linear Regression===<br />
Linear regression is an approach for modeling the scalar value <math>\, y</math> from a set of dependent variables <math>\,X</math>. In linear regression the goal is to find an appropriate set of dependent variables to <math>\, y</math> and try to estimate its value from the related set. While in classification the goal is to classify data to different groups in which the inner similarity among the group members are more than variables which belong to different groups.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
According to Bayes Classification we estimate the posterior as,<br/><br />
<math>P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{l}f_{l}(x)\pi_{l}}</math><br/><br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The simple linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
y_i = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
and we can denote it as<br />
:<math><br />
\begin{align}<br />
\mathbf{y} = \beta^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
where <math>\,\beta^{T} = (<br />
\beta_1,..., \beta_{d},\beta_0)</math> is a <math>1 \times (d+1)</math> vector and <math>\mathbf{X}=<br />
\begin{pmatrix}<br />
\mathbf{x}_{1}, \dots,\mathbf{x}_{n}\\<br />
1, \dots, 1<br />
\end{pmatrix}<br />
</math> is a <math>(d+1) \times n</math> matrix, here <math>\mathbf{x}_{i} </math> is a <math>d \times 1</math> vector<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{n}</math> and <math>\,y_{1}, ..., y_{n}</math> our goal is to find <math>\,\beta^{T}</math> such that the linear model fits the data while minimizing sum of squared errors using the [http://en.wikipedia.org/wiki/Least_squares Least Squares method].<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
We then try to minimize the residual sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\beta^{T}\mathbf{X})^{T}(\mathbf{y}-\beta^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}(\mathbf{y}-\beta^{T}\mathbf{X})^{T}<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}(\mathbf{y}^{T}-\mathbf{X}^{T}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}\mathbf{y}^{T}<br />
\end{align}<br />
</math><br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \hat\beta^{T}\mathbf{X} = <br />
\mathbf{y}\mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}^{T}(\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X} </math> is called the [http://en.wikipedia.org/wiki/Hat_matrix hat matrix].<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{l}f_{l}(x)\pi_{l}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1. This is more direct approach to classification since it do not need to estimate <math>\ f_k(x) </math> and <math>\ \pi_k </math>.<br />
<math>\ 1 \times P(Y=1|X=x)+0 \times P(Y=0|X=x)=E(Y|X) </math>. <br />
This model does not classify Y between 0 and 1, so it is not good but at times it can lead to a decent classifier. <math>\ y_i=\frac{1}{n_1} </math> <math>\ \frac{-1}{n_2} </math><br />
<br />
===Logistic Regression===<br />
The [http://en.wikipedia.org/wiki/Logistic_regression logistic regression] model arises from the desire to model the posterior probabilities of the <math>\displaystyle K</math> classes via linear functions in <math>\displaystyle x</math>, while at the same time ensuring that they sum to one and remain in [0,1].Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood ,using <math>\displaystyle Pr(Y|X)</math>. Since <math>\displaystyle Pr(Y|X)</math> completely specifies the conditional distribution, the multinomial distribution is appropriate. This model is widely used in biostatistical applications for two classes. For instance: people survive or die, have a disease or not, have a risk factor or not.<br />
<br />
==== logistic function ====<br />
[[File:200px-Logistic-curve.svg.png | Logistic Sigmoid Function]]<br />
<br />
<br />
<br />
A [http://en.wikipedia.org/wiki/Logistic_function logistic function] or logistic curve is the most common sigmoid curve. <br />
<br />
1. <math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
2. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
3. <math>y(0) = \frac{1}{2}</math><br />
<br />
4. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
5. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
====Intuition behind Logistic Regression====<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
====The Logistic Regression Model====<br />
<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
{{Cleanup|date=October 13 2010|reason=It could be useful to have sources for these graphs. We don't know if they are copyrighted}}<br />
{{Cleanup|date=October 18 2010|reason=I Could not find any source for these graphs. However, they following the definition of the defined probability. I don't think the generated graph as it is here is copyrighted, but if you worried you can draw this figure by applying the function and post the result.}}<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
{{Cleanup|date=October 13 2010|reason=It could be useful to have sources for these graphs. We don't know if they are copyrighted}}<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
====Fitting a Logistic Regression====<br />
Logistic regression tries to fit a distribution. The common practice in statistics is to fit density function, posterior density of each class(Pr(Y|X), to data using [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence and identical distribution)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
{{Cleanup|date=October 13 2010|reason=I think, in the following, y_i * x_i and the single x_i on the right side should both be transposed by matrix calculus?}}<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math> <br />
<br />
There are n+1 nonlinear equations in <math> \beta </math>. The first column is vector 1, then <math>\ \sum_{i=1}^n {y_i} =\sum_{i=1}^n p(\underline{x}_i;\underline{\beta}) </math> i.e. the expected number of class ones matches the observed number.<br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
====Extension====<br />
<br />
* When we are dealing with a problem with more than two classes, we need to generalize our logistic regression to a [http://en.wikipedia.org/wiki/Multinomial_logit Multinomial Logit model].<br />
<br />
* Limitations of Logistic Regression:<br />
:1. We know that there is no assumptions made about the distributions of the features of the data (i.e. the explanatory variables). However, the features should not be highly correlated with one another because this could cause problems with estimation.<br />
:2. Large number of data points (i.e.the sample sizes) are required for logistic regression to provide sufficient numbers in both classes. The more number of features/dimensions of the data, the larger the sample size required.<br />
<br />
==Lecture summary==<br />
{{Cleanup|date=October 18 2010|reason=Can anybody provide a better lecture summary? The one below is to just get it started}}<br />
In this lecture an introduction of the linear regression was presented as well as defining the density function for two-class problem. Maximum likelihood was used to define the distribution parameters (i.e. fitting density function to the logistic class.<br />
<br />
== Logistic Regression Cont. - October 14, 2010 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Estimating Parameters <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
'''Newton-Raphson Algorithm:'''<br /><br />
<br />
If we want to find <math>\ x^* </math> such that <math>\ f(x^*)=0</math><br />
<br />
We first pick a starting point <math>x^* = x^{old}</math> and and we solve:<br />
<br \><br />
<br />
<math>\ x^{*} \leftarrow x^{old}-\frac {f(x^{old})}{\partial f(x^{old})} </math> <br /><br />
<math> \ x^{old} \leftarrow x^{*}</math> <br />
<br /><br />
This is repeated till convergence <br />
<br />
If we want to maximize or minimize <math>\ f(x) </math>, then solve for <math>\ \partial f(x)=0 </math><br />
<br />
<math>\ X^{new} \leftarrow x^{old}-\frac {\partial f(x^{old})}{\partial^2 f(x^{old})} </math><br />
<br />
<br /><br />
<br />
In vector notation the above can be written as <br /><br />
<br />
<math><br />
X^{new} \leftarrow X^{old} - H^{-1}\Delta<br />
</math><br />
<br /><br />
H is the [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix] or second derivative matrix and <math>\Delta</math> is the gradient both evaluated at <math>X^{old}</math> <br />
<br /><br />
<br />
'''note:''' If the Hessian is not invertible the [http://en.wikipedia.org/wiki/Generalized_inverse generalized inverse] or pseudo inverse can be used<br />
<br /><br />
<br /><br />
<br />
<br />
As shown above ,the [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or Hessian.<br />
<br />
<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}_i))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{(d+1)}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i,i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^{T}\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
This is an adjusted response and it is solved repeatedly when <math>\ p </math>, <math>\ W </math>, and <math>\ z </math> changes. This algorithm is called [http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares iteratively reweighted least squares] because it solves the weighted least squares problem repeatedly.<br />
<br />
Recall that linear regression by least squares finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}^{T}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow arg \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta}^T)</math><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_{i,i}</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They both attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They both have linear boundaries.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1. There exists a closed form solution for least squares.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1. No closed form solution exists<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quadratically w.r.t dimension.<br />
#LDA estimate parameters more efficiently by using more information about data and samples without class labels can be also used in LDA. <br />
<br />
{{Cleanup|date=October 2010|reason= Could somebody please validate the following points}} <br />
{{Cleanup|date=October 2010|reason= I'm not too sure about the first point either, but it seems reasonable to me. Would be great if someone can confirm this point. Thanks}} <br />
<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust. (For high dimensionality logistic regression is more accommodating)<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
#Logistic regression is more robust, because it does not assume normal distribution regarding each independent variable.<br />
<br />
Many other advantages of logistic regression are explained [http://www.statgun.com/tutorials/logistic-regression.html here].<br />
<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>\,(d+1)</math><math>\,\times</math><math>\,(k-1)</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
===Lecture Summary===<br />
<br />
Traditionally logistic regression parameters are estimated using maximum likelihood. However , other optimization techniques may be used as well.<br />
<br /><br />
Since there is no closed form solution for finding the zero of the first derivative of the log likelihood the Newton Raphson algorithm is used. Since the problem is convex Newtons is guaranteed to converge to a global optimum.<br />
<br /><br />
Logistic regression requires less parameters than LDA or QDA and is therefore more favorable for high dimensional data.<br />
<br />
===Supplements===<br />
<br />
A detailed proof that logistic regression is convex is available [http://people.csail.mit.edu/jrennie/writing/convexLR.pdf here]. See '1 Binary LR' for the case we discussed in lecture.<br />
<br />
== '''Multi-Class Logistic Regression & Perceptron - October 19, 2010''' ==<br />
<br />
=== Lecture Summary ===<br />
<br />
In this lecture, the topic of logistic regression was finalized by covering the multi-class logistic regression and a new topic on perceptron was introduced. Perceptron is a linear classifier for two-class problems. The main goal of perceptron is classify data in 2 classes by minimizing the distances between the misclassified points and the decision boundary. This will be continued in the following lectures.<br />
<br />
=== Multi-Class Logistic Regression ===<br />
Recall that in two-class logistic regression, the posterior probability of one of the classes (say class 1) is modeled by a function in the form shown in the figure 1, while posterior probability of the second class is the complement of the first class. <br /><br /><br />
<math>\displaystyle P(Y=0 | X=x) = 1 - P(Y=1 | X=x)</math><br /><br />
<br />
This function is called sigmoid logistic function, which is the reason why this algorithm is called "logistic regression".<br />
[[File:Picture1.png|150px|thumb|right|<math>Fig.1: P(Y=1 | X=x)</math>]]<br />
<br />
<math>\displaystyle \sigma\,\!(a) = \frac {e^a}{1+e^a} = \frac {1}{1+e^{-a}}</math><br /><br /><br />
<br />
In the two-class logistic regression, we basically compare the posterior of one class to the other one using this ratio:<br /><br />
<br />
:<math> \frac{P(Y=1|X=x)}{P(Y=0|X=x)}</math><br /><br />
<br />
If we look at the natural logarithm of this ratio, we find that it is always a linear function in <math>x</math>:<br /><br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\underline{\beta}^T\underline{x} \quad \rightarrow (*)</math> <br /><br /><br />
<br />
What if we have more than two classes?<br /><br />
<br />
Using (*), we can extend the notion of logistic regression for the cases where we have more than two classes.<br /><br />
<br />
Assume we have <math>k</math> classes. Looking at the logarithm of the ratio of posteriors of each class and the k<sup>th</sup> class, we have: <br /><br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_1}^T\underline{x} </math> <br /><br />
:<math>\log\left(\frac{P(Y=2|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_2}^T\underline{x} </math> <br /><br />
<math> \vdots</math><br />
:<math>\log\left(\frac{P(Y=k-1|X=x)}{P(Y=k|X=x)}\right)=\underline{\beta_{k-1}}^T\underline{x} </math> <br /><br />
<br />
Although in the above posterior ratios, the denominator is chosen to be the posterior of the last class (class k), the choice of denominator is arbitrary in that the posterior estimates are equivariant under this choice - [http://www.springerlink.com/content/t45k620382733r71/ Linear Methods for Classification].<br /><br /><br />
<br />
Each of these functions is linear in <math>x</math>, however, we have different <math>\beta</math>s. We have to make sure that, the densities assigned to different classes sum to one.<br /><br /><br />
<br />
In general, we can write:<br />
<br /><math>P(Y=c | X=x) = \frac{e^{\underline{\beta_c}^T \underline{x}}}{1+\sum_{l=1}^{k-1}e^{\underline{\beta_l}^T \underline{x}}},\quad c \in \{1,\dots,k-1\} </math><br /><br />
<br /><math>P(Y=k | X=x) = \frac{1}{1+\sum_{l=1}^{k-1}e^{\underline{\beta_l}^T \underline{x}}}</math><br /><br />
These posteriors clearly sum to one. <br /><br /><br />
<br />
In the case of two-class problem, it is pretty simple to find <math>\beta</math> parameter (the <math>\beta</math> in two-class linear regression problems has <math>(d+1)\times1</math> dimension), as mentioned in previous lectures. In the multi-class case the iterative Newton method can be used, but here <math>\beta</math> is of size <math>(d+1)\times(k-1)</math> and the weight matrix W is a dense and non-diagonal matrix. This results in computationally inefficient, however feasible-to-be-solved algorithm. A trick would be to re-parametrize the logistic regression problem by expanding the input vector <math>x</math> (Question.4 in assignment no.2).<br />
<br /><br /><br />
<br />
===Nueral Network Concept===<br />
The concept of constructing an artificial neural network comes from scientists who like to simulate human neural network in their computers. They were trying to create computer programs that can learn like people. Neural network is a method in artificial intelligence which is a simplified models of neural processing in the brain, even though the relation between this model and brain biological architecture is not cleared yet.<br />
<br />
=== Perceptron ===<br />
<br />
Perceptron is a building block of Neural Networks. [http://en.wikipedia.org/wiki/Perceptron Perceptron] was invented in 1957 by [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt]. It is the basic building block of feedforward neural networks<br /><br /><br />
<br />
We know that least square obtained by regression of -1/1 response variable <math>\displaystyle Y</math> on observation <math>\displaystyle x</math>, lead to same coefficients as LDA. Recall that LDA minimizes the distance between discriminant function (decision boundary) and the data points. Least Square returns the sign of the linear combination of features as the class labels (figure 2). This was called perceptron in Engineering literature during 1950's. <br /><br /><br />
<br />
[[File:Perceptron.jpg|371px|thumb|right| Fig.2 Diagram of a linear perceptron ]]<br />
<br />
There is a cost function <math>\displaystyle D</math> that perceptron tries to minimize:<br /><br />
<br />
<math>D(\underline{\beta},\beta_0)=-\sum_{i \in M}y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math><br /><br />
<br />
where <math>\displaystyle M</math> is a set of misclassified points. <br /><br />
<br />
This is basically minimizing the sum of distances between the misclassified points and the decision boundary.<br /><br /><br />
<br />
'''Derivation''':'' The distances between the misclassified points and the decision boundary''.<br /><br /><br />
<br />
Consider points <math>\underline{x_1}</math>, <math>\underline{x_2}</math> and a decision boundary defined as <math>\underline{\beta}^T\underline{x}+\beta_0</math> as shown in figure 3.<br /><br />
<br />
[[File:DB.jpg|248px|thumb|right| Fig.3 Distance from the decision boundary ]]<br />
<br />
Both <math>\underline{x_1}</math> and <math>\underline{x_2}</math> lie on the decision boundary, then we have:<br /><br />
<math>\underline{\beta}^T\underline{x_1}+\beta_0=0 \rightarrow (1)</math><br /><br />
<math>\underline{\beta}^T\underline{x_2}+\beta_0=0 \rightarrow (2)</math><br /><br />
<br />
From (1) and (2):<br /><br />
<math>\underline{\beta}^T(\underline{x_2}-\underline{x_1})=0</math><br /><br />
<br />
Therefore, <math>\displaystyle \underline{\beta}</math> is orthogonal to <math>\underline{x_2}-\underline{x_1}</math> which is in the same direction with the decision boundary, which means that <math>\displaystyle \underline{\beta}</math> is orthogonal to the decision boundary. <br /><br />
<br />
Then the distance of a point <math>\underline{x_0}</math> from the decision boundary is: <br /><br />
<br />
<math>\underline{\beta}^T(\underline{x_0}-\underline{x_2})</math><br /><br />
<br />
From (2): <br /><br />
<br />
<math>\underline{\beta}^T\underline{x_2}= -\beta_0</math>. <br /><br />
<math>\underline{\beta}^T(\underline{x_0}-\underline{x_2})=\underline{\beta}^T\underline{x_0}-\underline{\beta}^T\underline{x_2}=\underline{\beta}^T\underline{x_0}+\beta_0</math><br /><br />
<br />
Therefore, distance between any point <math>\underline{x_{i}}</math> to the discriminant hyperplane is defined by <math>\underline{\beta}^T\underline{x_{i}}+\beta_0</math>.<br /><br /><br />
<br />
However, this quantity is not always positive. Considering <math>y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math>, if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive, since both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> are positive or both are negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'', then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_0)</math> and <math>\displaystyle y_{i}</math> is positive and the other one is negative; hence, the product <math>y_{i}(\underline{\beta}^T \underline{x_{i}}+\beta_0)</math> will be negative for a misclassified point. The "-" sign in <math>D(\underline{\beta},\beta_0)</math> makes this cost function always positive. <br /><br /><br />
<br />
==Perceptron Learning Algorithm and Feed Forward Neural Networks - October 21, 2010 ==<br />
===Lecture Summary===<br />
In this lecture, we finalize our discussion of the Perceptron by reviewing its learning algorithm, which is based on gradient descent. We then begin the next topic, Neural Networks (NN), and we focus on a NN that is useful for classification: the Feed Forward Neural Network (FFNN). The mathematical model for the FFNN is shown, and we review one of its most popular learning algorithms: Back-Propagation. <br />
<br />
To open the Neural Network discussion, we present a formulation of the universal function approximator. The mathematical model for Neural Networks is then built upon this formulation. We also discuss the trade-off between training error and testing error -- known as the generalization problem -- under the universal function approximator section.<br />
<br />
===Perceptron===<br />
The last lecture introduced the Perceptron and showed how it can suggest a solution for the 2-class classification problem. We saw that the solution requires minimization of a cost function, which is basically a summation of the distances of the misclassified data points to the separating hyperplane. This cost function is<br />
<br />
<math>D(\underline{\beta},\beta_0)=-\sum_{i \in M}y_{i}(\underline{\beta}^T \underline{x}_i+\beta_0),</math><br />
<br />
in which, <math>\,M</math> is the set of misclassified points. Thus, the objective is to find <math>\arg\min_{\underline{\beta},\beta_0} D(\underline{\beta},\beta_0)</math>.<br />
<br />
====Perceptron Learning Algorithm====<br />
To minimize <math>D(\underline{\beta},\beta_0)</math>, an algorithm that uses gradient-descent has been suggested. Gradient descent, also known as steepest descent, is a numerical optimization technique that starts from an initial value for <math>(\underline{\beta},\beta_0)</math> and recursively approaches an optimal solution. Each step of recursion updates <math>(\underline{\beta},\beta_0)</math> by subtracting from it a factor of the gradient of <math>D(\underline{\beta},\beta_0)</math>. Mathematically, this gradient is<br />
<br />
<math>\nabla D(\underline{\beta},\beta_0)<br />
= \left( \begin{array}{c}\cfrac{\partial D}{\partial \underline{\beta}} \\ \\ <br />
\cfrac{\partial D}{\partial \beta_0} \end{array} \right)<br />
= \left( \begin{array}{c} -\displaystyle\sum_{i \in M}y_{i}\underline{x}_i^T \\ <br />
-\displaystyle\sum_{i \in M}y_{i} \end{array} \right)</math><br />
<br />
However, the perceptron learning algorithm does not use the sum of the contributions from each observation to calculate the gradient for each step. Instead, each step uses the gradient contribution from only a single observation, and each successive step uses a different observation. This slight modification is called stochastic gradient descent. That is, instead of subtracting some factor of <math>\nabla D(\underline{\beta},\beta_0)</math> at each step, we subtract a factor of<br />
<br />
<math>\left( \begin{array}{c} y_{i}\underline{x}_i \\ <br />
y_{i} \end{array} \right)</math><br />
<br />
As a result, the pseudo code for the Perceptron Learning Algorithm is as follows:<br />
<br />
:1) Choose a random initial value for <math>(\underline{\beta},\beta_0)</math>.<br />
<br />
:2) <math>\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{old}}\\<br />
\beta_0^{\mathrm{old}}<br />
\end{pmatrix}<br />
\leftarrow<br />
\begin{pmatrix}<br />
\underline{\beta}^0\\<br />
\beta_0^0<br />
\end{pmatrix}</math><br />
<br />
:3) <math>\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{new}}\\<br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
\leftarrow<br />
\begin{pmatrix}<br />
\underline{\beta}^{\mathrm{old}}\\<br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix}<br />
y_i \underline{x_i}\\<br />
y_i<br />
\end{pmatrix}</math> for some <math>\,i \in M</math>.<br />
<br />
:4) If the termination criterion has not been met, go back to step 3 and use a different observation datapoint (i.e. a different <math>\,i</math>).<br />
<br />
The learning rate <math>\,\rho</math> controls the step size of convergence toward <math>\min_{\underline{\beta},\beta_0} D(\underline{\beta},\beta_0)</math>. A larger value for <math>\,\rho</math> causes the steps to be larger. If <math>\,\rho</math> is set to be too large, however, then the minimum could be missed (over-stepped).<br />
In practice, <math>\rho</math> can be adaptive and not fixed, it means that, in the first steps <math>\rho</math> could be larger than the last steps. At the beginning, larger <math>\rho</math> helps to find the approximate answer sooner. And smaller <math>\rho</math> in last steps help to tune the final answer more accurately. <br />
<br />
<br />
As mentioned earlier, the learning algorithm uses just one of the data points at each iteration; this is the common practice when dealing with online applications. In an online application, datapoints are accessed one-at-a-time because training data is not available in batch form. The learning algorithm does not require the derivative of the cost function with respect to the previously seen points; instead, we just have to take into consideration the effect of each new point.<br />
<br />
One way that the algorithm could terminate is if there are no more mis-classified points (i.e. if set <math>\,M</math> is empty. As long as there are points in <math>\,M</math>, the algorithm continues until some other termination criterion is reached. Termination criterion for an optimization algorithm is usually convergence, but for numerical methods this is not well-defined. In theory, convergence is realized when the gradient of the cost function is zero; in numerical methods an answer close to zero within some margin of error is taken instead.<br />
<br />
Since the data is linearly-separable, the solution is theoretically guaranteed to converge in a finite number of iterations. This number of iterations depends on the <br />
<br />
* learning rate <math>\,\rho</math><br />
<br />
* initial value <math>(\underline{\beta},\beta_0)</math><br />
<br />
* difficulty of the problem. The problem is more difficult if the gap between the classes of data is very small.<br />
<br />
Note that we consider the offset term <math>\beta_0</math> separately from the <math>\underline{\beta}</math> to distinguish this formulation from those in which the direction of the hyperplane (<math>\underline{\beta}</math>) has been considered.<br />
<br />
A major concern about gradient descent is that it may get trapped in local optimal solutions.<br />
<br />
====Some notes on the Perceptron Learning Algorithm====<br />
<br />
* If there is access to the training data points in a batch form, we should better take advantage of a closed optimization technique like least-squares or maximum-likelihood estimation for linear classifiers. (These closed solutions has been around many years before invention of the Perceptron).<br />
<br />
* Just like the linear classifier, a Perceptron can discriminate between only two classes at a time, and one can generalize its performance for multi-class problems by using one of the <math>k-1</math>, <math>k</math>, or <math>k(k-1)/2</math>-hyperplane methods.<br />
<br />
* If the two classes are linearly separable, the algorithm will converge in a finite number of iterations to a hyperplane, which makes the error of training data zero. The convergence is guaranteed if the learning rate is set adequately.<br />
<br />
* If the two classes are not linearly separable, the algorithm will never converge. So, one may think of a termination criterion in these cases. (e.g. a maximum number of iterations in which convergence is expected, or the rate of changes in both a cost function and its derivative).<br />
<br />
* In the case of linearly separable classes, the final solution and number of iterations will be dependent on the initial conditions, learning rate, and the gap between the two classes. In general, a smaller gap between classes requires a greater number of iterations for the algorithm to converge.<br />
<br />
* Learning rate --or updating step-- has a direct impact on both number of iterations and the accuracy of the solution for the optimization problem. Smaller quantities for this factor make convergence slower, even though we will end up with a more accurate solution. In the opposite way, larger values for learning rate make the process faster, even though we may lose some precision. So, one may make a balance for this trade-off in order to get fast enough to an accurate enough solution. (exploration vs. exploitation)<br />
<br />
In the upcoming lectures, we introduce the Support Vector Machines (SVM), which use a method similar in iterational optimization scheme to what the Perceptron suggests, but have a different definition for the cost function.<br />
<br />
===Universal Function Approximator===<br />
The universal function approximator is a mathematical formulation for a group of estimation techniques. The usual formulation for it is<br />
<br />
<math>\hat{Y}(x)=\sum\limits_{i=1}^{n}\alpha_i\sigma(\omega_i^Tx+b_i),</math><br />
<br />
where <math>\hat{Y}(x)</math> is an estimation for a function like <math>\,Y(x)</math>. According to the universal approximation theorem we have<br />
<br />
<math>|\hat{Y}(x) - Y(x)|<\epsilon,</math><br />
<br />
which means that <math>\hat{Y}(x)</math> can get as close to <math>\,Y(x)</math>, as necessary.<br />
<br />
This formulation assumes that the output, <math>\,Y(x)</math>, is a linear combination of a set of functions like <math>\,\sigma(.)</math> where <math>\,\sigma(.)</math> is a nonlinear function of the inputs or <math>\,x_i</math>s.<br />
<br />
====Generalization Factors====<br />
Even though this formulation represents a universal function approximator, which means that it can be fitted to a set of data as closely as demanded, the closeness of fit must be carefully decided upon. In many cases, the purpose of the model is to target unseen data. However, the fit to this unseen data is impossible to determine before it arrives.<br />
<br />
To overcome this dilemma, a common practice is to divide the test data points into two sets: training data and validation data. We use the training data to estimate the fixed parameters for the model, and then use the validation data to find values for the construction-dependent parameters. How these construction-dependent parameters vary depends on the model. In the case of a polynomial, the construction-dependent parameter would be its highest degree, and for a neural network, the construction-dependent parameter could be the number of hidden layers and the number of neurons in each layer.<br />
<br />
These matters on model generalization vs. complexity matters will be discussed with more detail in the lectures to follow.<br />
<br />
===Feed-Forward Neural Network===<br />
The Neural Network (NN) is one application of the universal function approximator. It can be thought of as a system of Perceptrons linked together as units of a network. One particular NN useful for classification is the Feed-Forward Neural Network (FFNN), which consists of multiple "hidden layers" of Perceptron units. Our discussion here is based around the FFNN, which has a toplogy shown in Figure 1. The first hidden layer of units receive input from the original features. Between the hidden layers, connections from each unit are always directed to units in the next adjacent layer. The output layer, which receives input only from the last hidden layer, each unit produces a target measurement for a distinct class (i.e. <math>\,K</math> classes require <math>\,K</math> units). In Figure 1, the units in a single layer are distributed vertically, and the inputs and outputs of the network are shown as the far left and right layers respectively.<br />
<br />
[[File:FFNN.png|300px|thumb|right|Fig.1 A common architecture for the FFNN]]<br />
<br />
====Mathematical Model of the FFNN with One Hidden Layer====<br />
The FFNN with one hidden layer for a <math>\,K</math>-class problem is defined as follows. Let <math>\,d</math> be the number of input features, <math>\,p</math> be the number of units in the hidden layer, and <math>\,K</math> be the number of classes (i.e. the number of units in the output layer).<br />
<br />
Each neural unit calculates its derived feature (i.e. output) using a linear combination of its inputs. Suppose <math>\,\underline{x}</math> is the <math>\,d</math>-dimensional vector of input features. Then, each neural unit uses a <math>\,d</math>-dimensional vector of weights to combine these input features: for the <math>\,i</math>th neural unit, let <math>\underline{u}_i</math> be this vector of weights. The linear combination calculated by the <math>\,i</math>th unit is then given by<br />
<br />
<math>a_i = \underline{u}_i^T\underline{x}</math><br />
<br />
However, we want the derived feature to lie between 0 and 1, so we apply an ''activating function'' <math>\,\sigma(a)</math>. The derived feature for the <math>\,i</math>th unit is then given by<br />
<br />
<math>\,z_i = \sigma(a_i)</math> where <math>\,\sigma</math> is typically the logistic function<br />
<br />
<math>\sigma(a) = \cfrac{1}{1+e^{-a}}</math><br />
<br />
Now, we place each of the derived features <math>\,z_i</math> from the hidden layer into a <math>\,p</math>-dimensional vector:<br />
<br />
<math>\underline{z} = \left[ \begin{array}{c} z_1 \\ z_2 \\ \vdots \\ z_p \end{array}\right]</math><br />
<br />
Like in the hidden layer, each unit in the output layer calculates its derived feature using a linear combination of its inputs. Each neural unit uses a <math>\,p</math>-dimensional vector of weights to combine the input features derived from the hidden layer. Let <math>\,\underline{w}_k</math> be this vector of weights used in the <math>\,k</math>th unit. The linear combination calculated by the <math>\,k</math>th unit is then given by<br />
<br />
<math>\hat{y}_k = \underline{w}_k^T\underline{z}</math><br />
<br />
<math>\,y_k</math> is thus the target measurement for the <math>\,k</math>th class. Note that an activation function <math>\,\sigma</math> is not used here.<br />
<br />
Notice that in each of the units, two operations take place:<br />
<br />
* a linear combination of the neuron's inputs is calculated using corresponding weights<br />
<br />
* a nonlinear operation on the linear combination is performed. <br />
<br />
These two calculations are shown in Figure 2. <br />
<br />
The nonlinear function <math>\,\sigma(.)</math> is called the activation function. Activation functions, like the logarithmic function shown earlier, are usually continuous and have a limited range. Another common activation function used in neural networks is <math>\,tanh(x)</math> (Figure 3).<br />
<br />
[[File:neuron2.png|300px|thumb|right|Fig.2 A general construction for a single neuron]]<br />
[[File:actfcn.png|300px|thumb|right|Fig.3 <math>tanh</math> as activation function]]<br />
<br />
The NN can be applied as a regression method or as a classifier, and the output layer differs depending on the application. The major difference between regression and classification is in the output space of the model, which is continuous in the case of regression, and discrete in the case of classification. For a regression task, no consideration is needed beyond what has already been mentioned earlier, since the outputs of the network would already be continuous. However, to use the neural network as a classifier, a threshold stage is necessary.<br />
<br />
====Mathematical Model of the FFNN with Multiple Hidden Layers====<br />
In the FFNN model with a single hidden layer, the derived features were represented as elements of the vector <math>\underline{z}</math>, and the original features were represented as elements of the vector <math>\underline{x}</math>. In the FFNN model with more than one hidden layer, <math>\underline{z}</math> is processed by the second hidden layer in the same way that <math>\underline{x}</math> was processed by the first hidden layer. Perceptrons in the second layer each use their own combination of weights to calculate a new set of derived features. These new derived features are processed by the third hidden layer in a similar way, and the cycle repeats for each additional hidden layer. This progression of processing is depicted in Figure 4.<br />
<br />
====Back-Propagation Learning Algorithm====<br />
<br />
[[File:bpl.png|300px|thumb|right|Fig.4 Labels for weights and derived features in the FFNN.]]<br />
<br />
Every linear-combination calculation in the FFNN involves weights that need to be set, and these weights are set using training data and an algorithm called Back-Propagation. This algorithm is similar to the gradient-descent algorithm introduced in the discussion of the Perceptron. The primary difference is that the gradient used in Back-Propagation is calculated in a more complicated way.<br />
<br />
First of all, we want to minimize the error between the estimated and true target measurements for the training data. That is, if <math>\,U</math> is the set of all weights in the FFNN, then we want to determine<br />
<br />
<math>\arg\min_U \left|y - \hat{y}\right|^2</math><br />
<br />
Now, suppose the hidden layers of the FFNN are labelled as in Figure 4. Then, we want to determine the derivative of <math>\left|y - \hat{y}\right|^2</math> with respect to each weight in the hidden layers of the FFNN. For weights <math>\,u_{jl}</math> this means we will need to find<br />
<br />
<math><br />
\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}}<br />
= \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j}\cdot<br />
\cfrac{\partial a_j}{\partial u_{jl}} = \delta_{j}z_l<br />
</math><br />
<br />
However, the closed-form solution for <math>\,\delta_{j}</math> is unknown, so we develop a recursive definition (<math>\,\delta_{j}</math> in terms of <math>\,\delta_{i}</math>):<br />
<br />
<math><br />
\delta_j = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j} <br />
= \sum_{i=1}^p \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_i}\cdot<br />
\cfrac{\partial a_i}{\partial a_j} <br />
= \sum_{i=1}^p \delta_i\cdot u_{ij} \cdot \sigma'(a_j)<br />
= \sigma'(a_j)\sum_{i=1}^p \delta_i \cdot u_{ij}<br />
</math><br />
<br />
We also need to determine the derivative of <math>\left|y - \hat{y}\right|^2</math> with respect to each weight in the ''output layer'' <math>\,k</math> of the FFNN (this layer is not shown in Figure 4, but it would be the next layer to the right of the rightmost layer shown). For weights <math>\,u_{ki}</math> this means<br />
<br />
<math><br />
\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{ki}}<br />
= \cfrac{\partial \left|y - \sum_i u_{ki}z_i\right|^2}{\partial u_{ki}}<br />
= -2(y - \sum_i u_{ki}z_i)z_i<br />
= -2(y - \hat{y})z_i<br />
</math><br />
<br />
With similarity to our computation of <math>\,\delta_j</math>, we define<br />
<br />
<math>\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_k}</math><br />
<br />
However, <math>\,a_k = \hat{y}</math> because an activation function is not applied in the output layer. So, our calculation becomes<br />
<br />
<math>\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial \hat{y}}<br />
= -2(y - \hat{y})</math><br />
<br />
Now that we have <math>\,\delta_k</math> and a recursive definition for <math>\,\delta_j</math>, it is clear that our weights can be deduced by starting from the output layer and working through the hidden layers through toward the input layer.<br />
<br />
Based on the above derivation, our algorithm for determining weights in the FFNN is as follows<br />
<br />
:1) Choose a random initial weights.<br />
<br />
:2) Apply a new datapoint <math>\underline{x}</math> to the FFNN as the input layer, and calculate the values for all units.<br />
<br />
:3) Compute <math>\,\delta_k = -2(y_k - \hat{y}_k)</math>.<br />
<br />
:4) Back-propagate layer-by-layer by computing <math>\delta_j = \sigma'(a_j)\sum_{i=1}^p \delta_i \cdot u_{ij}</math> for all units.<br />
<br />
:5) Compute <math>\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} = \delta_{j}z_l</math> for all weights <math>\,u_{jl}</math>.<br />
<br />
:6) Update <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}}<br />
- \rho \cdot \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{jl}} </math> for all weights <math>\,u_{jl}</math> where <math>\,\rho</math> is the learning rate.<br />
<br />
:7) If the termination criterion has not been met, go back to step 2 and apply another datapoint (ie. begin a new "epoch").<br />
<br />
====Alternative Description of the Back-Propagation Algorithm====<br />
Label the inputs and outputs of the <math>\,i</math>th hidden layer <math>\underline{x}_i</math> and <math>\underline{y}_i</math> respectively, and let <math>\,\sigma(.)</math> be the activation function for all neurons. We now have<br />
<br />
<math>\begin{align}<br />
\begin{cases}<br />
\underline{y}_1=\sigma(W_1.\underline{x}_1),\\<br />
\underline{y}_2=\sigma(W_2.\underline{x}_2),\\<br />
\underline{y}_3=\sigma(W_3.\underline{x}_3),<br />
\end{cases}<br />
\end{align}</math><br />
<br />
Where <math>\,W_i</math> is a matrix of the connection's weights, between two layers of <math>\,i</math> and <math>\,i+1</math>, and has <math>\,n_i</math> columns and <math>\,n_i+1</math> rows, where <math>\,n_i</math> is the number of neurons of the <math>\,i^{th}</math> layer.<br />
<br />
Considering this matrix equations, one can imagine a closed form for the derivative of the error in respect to the weights of the network. For a neural network with two hidden layers, the equations are as follows.<br />
<br />
<math>\begin{align}<br />
\frac{\partial E}{\partial W_3}=&diag(e).\sigma'(W_3.\underline{x}_3).(\underline{x}_3)^T,\\<br />
\frac{\partial E}{\partial W_2}=&\sigma'(W_2.\underline{x}_2).(\underline{x}_2)^T.diag\{\sum rows\{diag(e).diag(\sigma'(W_3.\underline{x}_3)).W_3\}\},\\<br />
\frac{\partial E}{\partial W_1}=&\sigma'(W_1.\underline{x}_1).(\underline{x}_1)^T.diag\{\sum rows\{diag(e).diag(\sigma'(W_3.\underline{x}_3)).W_3.diag(\sigma'(W_2.\underline{x}_2)).W_2\}\},<br />
\end{align}</math><br />
<br />
where <math>\,\sigma'(.)</math> is the derivative of the activation function <math>\,\sigma(.)</math>.<br />
<br />
Using this closed form derivative, it is possible to code the procedure for any number of layers and neurons. Here is a Matlab code for backpropagation algorithm. (<math>\,tanh</math> is utilized as the activation function.)<br />
<br />
<br />
while i < ep<br />
i = i + 1;<br />
data = shuffle(data,2);<br />
for j = 1:Q<br />
io = zeros(max(n)+1,length(n));<br />
gp = io;<br />
io(1:n(1)+1,1) = [1;data(1:f,j)];<br />
for k = 1:l<br />
io(2:n(k+1)+1,k+1) = w(2:n(k+1)+1,1:n(k)+1,k)*io(1:n(k)+1,k);<br />
gp(1:n(k+1)+1,k) = [0;1./(cosh(io(2:n(k+1)+1,k+1))).^2];<br />
io(1:n(k+1)+1,k+1) = [1;tanh(io(2:n(k+1)+1,k+1))];<br />
wg(1:n(k+1)+1,1:n(k)+1,k) = diag(gp(1:n(k+1)+1,k))*w(1:n(k+1)+1,1:n(k)+1,k);<br />
end<br />
e = [0;io(2:n(l+1)+1,l+1) - data(f+1:dd,j)];<br />
wg(1:n(l+1)+1,1:n(l)+1,l) = diag(e)*wg(1:n(l+1)+1,1:n(l)+1,l);<br />
gp(1:n(l+1)+1,l) = diag(e)*gp(1:n(l+1)+1,l);<br />
d = eye(n(l+1)+1);<br />
E(i) = E(i) + 0.5*norm(e)^2;<br />
for k = l:-1:1<br />
w(1:n(k+1)+1,1:n(k)+1,k) = w(1:n(k+1)+1,1:n(k)+1,k) - ro*diag(sum(d,1))*gp(1:n(k+1)+1,k)*(io(1:n(k)+1,k)');<br />
d = d*wg(1:n(k+1)+1,1:n(k)+1,k);<br />
end<br />
end<br />
end<br />
<br />
====Some notes on the neural network and its learning algorithm====<br />
<br />
* The activation functions are usually linear around the origin. If this is the case, choosing random weights between the <math>\,-0.5</math> and <math>\,0.5</math>, and normalizing the data may boost up the algorithm in the very first steps of the procedure, as the linear combination of the inputs and weights falls within the linear area of the activation function.<br />
<br />
* Learning of the neural network using backpropagation algorithm takes place in epochs. An Epoch is a single pass through the entire training set.<br />
<br />
* It is a common practice to randomly change the permutation of the training data in each one of the epochs, to make the learning independent of the data permutation.<br />
<br />
* Given a set of data for training a neural network, one should keep aside a ratio of it as the validation dataset, to obtain a sufficient number of layers and number of neurons in each of the layers. The best construction may be the one which leads to the least error for the validation dataset. Validation data may not be used as the training of the network.<br />
<br />
* We can also use the validation-training scheme to estimate how many epochs is enough for training the network.<br />
<br />
* It is also common to use other optimization algorithms as steepest descent and conjugate gradient in a batch form.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
==== Difficulties of training deep architecture <ref>{{Cite journal | title = Exploring Strategies for Training Deep Neural Networks | url = http://jmlr.csail.mit.edu/papers/volume10/larochelle09a/larochelle09a.pdf | year = 2009 | journal = Journal of Machine Learning Research | page = 1-40 | volume = 10 | last1 = Larochelle | first1 = H. | last2 = Bengio | first2 = Y. | last3 = Louradour | first3 = J. | last4 = Lamblin | first4 = P. }}</ref> ====<br />
<br />
Given a particular task, a natural way to train a deep network is to frame it as an optimization<br />
problem by specifying a supervised cost function on the output layer with respect to the desired<br />
target and use a gradient-based optimization algorithm in order to adjust the weights and biases<br />
of the network so that its output has low cost on samples in the training set. Unfortunately, deep<br />
networks trained in that manner have generally been found to perform worse than neural networks<br />
with one or two hidden layers.<br />
<br />
We discuss two hypotheses that may explain this difficulty. The first one is that gradient descent<br />
can easily get stuck in poor local minima (Auer et al., 1996) or plateaus of the non-convex training<br />
criterion. The number and quality of these local minima and plateaus (Fukumizu and Amari, 2000)<br />
clearly also influence the chances for random initialization to be in the basin of attraction (via<br />
gradient descent) of a poor solution. It may be that with more layers, the number or the width<br />
of such poor basins increases. To reduce the difficulty, it has been suggested to train a neural<br />
network in a constructive manner in order to divide the hard optimization problem into several<br />
greedy but simpler ones, either by adding one neuron (e.g., see Fahlman and Lebiere, 1990) or one<br />
layer (e.g., see Lengell´e and Denoeux, 1996) at a time. These two approaches have demonstrated to<br />
be very effective for learning particularly complex functions, such as a very non-linear classification<br />
problem in 2 dimensions. However, these are exceptionally hard problems, and for learning tasks<br />
usually found in practice, this approach commonly overfits.<br />
<br />
This observation leads to a second hypothesis. For high capacity and highly flexible deep networks,<br />
there actually exists many basins of attraction in its parameter space (i.e., yielding different<br />
solutions with gradient descent) that can give low training error but that can have very different generalization<br />
errors. So even when gradient descent is able to find a (possibly local) good minimum<br />
in terms of training error, there are no guarantees that the associated parameter configuration will<br />
provide good generalization. Of course, model selection (e.g., by cross-validation) will partly correct<br />
this issue, but if the number of good generalization configurations is very small in comparison<br />
to good training configurations, as seems to be the case in practice, then it is likely that the training<br />
procedure will not find any of them. But, as we will see in this paper, it appears that the type of<br />
unsupervised initialization discussed here can help to select basins of attraction (for the supervised<br />
fine-tuning optimization phase) from which learning good solutions is easier both from the point of<br />
view of the training set and of a test set.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
===Business Applications of Neural Networks===<br />
<br />
Neural networks are increasingly being used in real-world business applications and, in some cases, such as fraud detection, they have already become the method of choice. Their use for risk assessment is also growing and they have been employed to visualize complex databases for marketing segmentation. This method covers a wide range of business interests — from finance management, through forecasting, to production. The combination of statistical, neural and fuzzy methods now enables direct quantitative studies to be carried out without the need for rocket-science expertise.<br />
<br />
* On the Use of Neural Networks for Analysis Travel Preference Data <br />
* Extracting Rules Concerning Market Segmentation from Artificial Neural Networks <br />
* Characterization and Segmenting the Business-to-Consumer E-Commerce Market Using Neural Networks<br />
* A Neurofuzzy Model for Predicting Business Bankruptcy <br />
* Neural Networks for Analysis of Financial Statements <br />
* Developments in Accurate Consumer Risk Assessment Technology <br />
* Strategies for Exploiting Neural Networks in Retail Finance <br />
* Novel Techniques for Profiling and Fraud Detection in Mobile Telecommunications<br />
* Detecting Payment Card Fraud with Neural Networks<br />
* Money Laundering Detection with a Neural-Network <br />
* Utilizing Fuzzy Logic and Neurofuzzy for Business Advantage</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=7148stat841f102010-10-19T05:39:24Z<p>Hclam: /* Linear Regression */</p>
<hr />
<div>==[[Proposal Fall 2010]] ==<br />
==[[statf10841Scribe|Editor sign up]] ==<br />
{{Cleanup|date=October 8 2010|reason=Provide a summary for each topic here.}}<br />
== Summary ==<br />
=== Classification ===<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>.<br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>.<br />
=== Principle Component Analysis ===<br />
Principal component analysis (PCA) is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf]. Depending on where this methodology is applied, other common names of PCA include the [http://en.wikipedia.org/wiki/Karhunen%E2%80%93Lo%C3%A8ve_theorem Karhunen–Loève transform (KLT)] , the [http://en.wikipedia.org/wiki/Harold_Hotelling Hotelling transform], and the proper orthogonal decomposition (POD). PCA is the simplist [http://en.wikipedia.org/wiki/Eigenvector eigenvector]-based [http://en.wikipedia.org/wiki/Multivariate_analysis multivariate analysis]. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or '''principal components''') of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could.<br />
<br />
==[[f10_Stat841_digest |Digest ]] ==<br />
<br />
== ''' Reference Textbook''' ==<br />
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
== ''' Classification - September 21, 2010''' ==<br />
<br />
=== Classification ===<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which ones were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle (384 BC - 322 BC) when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression], [http://en.wikipedia.org/wiki/Cluster_analysis clustering], and [http://en.wikipedia.org/wiki/Dimension_reduction dimensionality reduction] (feature extraction or manifold learning). Please be noted that some people consider classification to be a broad area that consists of both supervised and unsupervised methods of classifying data. In this view, as can be seen in [http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm this link], clustering is simply a special case of classification and it may be called '''unsupervised classification'''.<br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers, a link to a source of which can be found [http://www.e-knowledge.ca/quotes.php?topic=Knowledge here].<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers <br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that can take a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
<br />
The '''true error rate''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
<br />
In practice, the empirical error rate is obtained to estimate the true error rate, whose value is impossible to be known because the parameter values of the underlying process cannot be known but can only be estimated using available data. The empirical error rate, in practice, estimates the true error rate quite well in that, as mentioned [http://www.liebertonline.com/doi/pdf/10.1089/106652703321825928 here], it is an unbiased estimator of the true error rate.<br />
<br />
=== Bayes Classifier ===<br />
<br />
{{Cleanup|date=October 14 2010|reason=In response to the previous tag: The naive Bayes classifier is a special (simpler) case of the Bayes classifier. It uses an extra assumption: that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. This assumption allows for an easier likelihood function <math>\,f_y(x)</math> in the equation:<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
The simper form of the likelihood function seen in the naive Bayes is:<br />
:<math><br />
\begin{align}<br />
f_y(x) = P(X=x|Y=y) = {\prod_{i=1}^{n} P(X_{i}=x_{i}|Y=y)}<br />
\end{align}<br />
</math><br />
The Bayes classifier taught in class was not the naive Bayes classifier. Perhaps a comment should be made about the naive Bayes classifier in the body of the text}}<br />
<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' Theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers [1]. Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests][2].<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into the most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class <math> y \in \mathcal{Y} </math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the posterior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h^*</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
The Bayes classifier is the optimal classifier in that it results in the least possible true probability of misclassification for any given new data input, i.e., for any generic classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief. As a result, the estimated values of the components in the trained model may deviate quite a bit from their true population values, and this can ultimately cause the calculated posterior probabilities of inputs to deviate quite a bit from their true values. Estimation of all these probability functions, as likelihood, prior probability, and evidence function is a very expensive task, computationally, which also makes some other classifiers more favorable than Bayes classifier.<br />
<br />
A detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here].<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h^*</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h^*)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
'''Theorem'''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course. ''Note: these are the known y values in the training data.'' <br />
<br />
These known data are summarized in the following tables:<br />
<br />
:[[File:裁剪.jpg]]<br />
{{Cleanup|date=September 2010|reason=this graph is not complete, the reason is that it should be in consistent with the computation below.}}<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
<br />
The Bayesian view of probability states that, for any event E, event E has a [http://en.wikipedia.org/wiki/Prior_probability prior probability] that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability of event E's occurrence, can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>. If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow. This is because one cannot possibly carry out trials for any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h^*)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the hyperplane the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and we shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that both of the two classes have [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distributions] and the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> to <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of <math>\,D(h^*)</math> is as follows:<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out alike terms and factoring).<br />
<br />
It is easy to see that, under LDA, the Bayes's classifier's decision boundary <math>\,D(h^*)</math> has the form <math>\,ax+b=0</math> and it is linear in <math>\,x</math>. This is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\left( \mu_m^\top\Sigma^{-1}<br />
\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n) \right)=0</math> . In addition, for any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between the centers (means) of <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice, where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
According to [http://www.lsv.uni-saarland.de/Vorlesung/Digital_Signal_Processing/Summer06/dsp06_chap9.pdf this link], some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that the data in each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - September 23, 2010''' ==<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix <math>\, \Sigma</math>.<br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix <math>\, \Sigma</math>. Instead, QDA makes the assumption that each class <math>\, k</math> has its own covariance matrix <math>\, \Sigma_k</math>.<br />
<br />
The derivation of the Bayes classifier's decision boundary <math>\,D(h^*)</math> under QDA is similar to that under LDA. Again, let us first consider the two-classes case where <math>\, \mathcal{Y}=\{0, 1\}</math>. This derivation is given as follows: <br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0</math> (by cancellation)<br />
:<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_1|)-\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1)+\log(\pi_1)=-\frac{1}{2}\log(|\Sigma_0|)-\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0)+\log(\pi_0)</math> (by taking the log of both sides)<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top\Sigma_1^{-1}x + \mu_1^\top\Sigma_1^{-1}\mu_1 - 2x^\top\Sigma_1^{-1}\mu_1 - x^\top\Sigma_0^{-1}x - \mu_0^\top\Sigma_0^{-1}\mu_0 + 2x^\top\Sigma_0^{-1}\mu_0 \right)=0</math> (by expanding out)<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top(\Sigma_1^{-1}-\Sigma_0^{-1})x + \mu_1^\top\Sigma_1^{-1}\mu_1 - \mu_0^\top\Sigma_0^{-1}\mu_0 - 2x^\top(\Sigma_1^{-1}\mu_1-\Sigma_0^{-1}\mu_0) \right)=0</math> <br />
<br />
It is easy to see that, under QDA, the decision boundary <math>\,D(h^*)</math> has the form <math>\,ax^2+bx+c=0</math> and it is quadratic in <math>\,x</math>. This is where the word ''quadratic'' in quadratic discriminant analysis comes from.<br />
<br />
As is the case with LDA, QDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary <math>\,D(h^*)</math> between classes <math>\,m </math> and <math>\,n</math> to be <math>\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\log(\frac{|\Sigma_m|}{|\Sigma_n|})-\frac{1}{2}\left( x^\top(\Sigma_m^{-1}-\Sigma_n^{-1})x + \mu_m^\top\Sigma_m^{-1}\mu_m - \mu_n^\top\Sigma_n^{-1}\mu_n - 2x^\top(\Sigma_m^{-1}\mu_m-\Sigma_n^{-1}\mu_n) \right)=0</math>.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h^*(x) = \arg\max_{k} \delta_k(x)</math> <br />
where, <br />
* In the case of LDA, which assumes that a common covariance matrix is shared by all classes, <math> \,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is linear in <math>\,x</math>.<br />
<br />
* In the case of QDA, which assumes that each class has its own covariance matrix, <math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math>, and the Bayes classifier's decision boundary <math>\,D(h^*)</math> is quadratic in <math>\,x</math>.<br />
<br />
<br />
'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
[http://www.stat.cmu.edu/~larry/=stat707/notes10.pdf See Theorem 46.6 Page 133]<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the Maximum Likelihood estimates from the sample for <math>\,\pi,\mu_k,\Sigma_k</math> in place of their true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k-d}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance, denoted <math>\Sigma</math>, is defined as the weighted average of the covariance for each class. <br />
<br />
In the case where we need a common covariance matrix, we get the estimate using the following equation:<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{n} </math><br />
<br />
Where: <math>\,n_r</math> is the number of data points in class r, <math>\,\Sigma_r</math> is the covariance of class r and <math>\,n</math> is the total number of data points.<br />
<br />
===Computation For QDA And LDA===<br />
<br />
First, let us consider QDA, and examine each of the following two cases.<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
<math>\, \Sigma_k = I </math> for every class <math>\,k</math> implies that our data is spherical. This means that the data of each class <math>\,k</math> is distributed symmetrically around the center <math>\,\mu_k</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{-1}{2}log(|I|)</math>, is zero since <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximize <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = U_kS_kV_k^\top = U_kS_kU_k^\top </math> (In general when <math>\,X=U_kS_kV_k^\top</math>, <math>\,U_k</math> is the eigenvectors of <math>\,X_kX_k^T</math> and <math>\,V_k</math> is the eigenvectors of <math>\,X_k^\top X_k</math>. <br />
So if <math>\, X_k</math> is symmetric, we will have <math>\, U_k=V_k</math>. Here <math>\, \Sigma_k </math> is symmetric, because it is the covariance matrix of <math> X_k </math>) and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (U_kS_kU_k^\top)^{-1} = (U_k^\top)^{-1}S_k^{-1}U_k^{-1} = U_kS_k^{-1}U_k^\top </math> (since <math>\,U_k</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top U_kS_k^{-1}U_k^T(x-\mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-1}(U_k^\top x-U_k^\top \mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top x-U_k^\top\mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the squared Euclidean distance between <math> \, S_k^{-\frac{1}{2}}U_k^\top x </math> and <math>\, S_k^{-\frac{1}{2}}U_k^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>.<br />
<br />
A similar transformation of all the centers can be done from <math>\,\mu_k</math> to <math>\,\mu_k^*</math> where <math> \, \mu_k^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top \mu_k </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math> and <math>\,\mu_k^*</math>, treating them as in Case 1 above.<br />
<br />
{{Cleanup|date=October 18 2010|reason=The sentence above may cause some misleading. In general case, <math>\,\Sigma_k </math> may not be the same . So you can't treat them completely the same as in Case 1 above. You need to compute <math>\, log{|\Sigma_k |} </math> differently. Here is a detailed discussion below:}}<br />
{{Cleanup|date=October 18 2010|reason=The sentence above is right since by transforming<math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x </math>, the new variable variance is <math>I</math>}}<br />
<br />
<br />
Note that when we have multiple classes, we also need to compute <math>\, log{|\Sigma_k|}</math> respectively. Then we compute <math> \,\delta_k </math> for QDA .<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, in another word, have same covariance <math>\,\Sigma_k</math>,else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
{{Cleanup|date=October 18 2010|reason=The statement above may not be true, because in assignment 1, we did do the QDA computation using this approach although the corresponding three covarience matrices are different, the reason why the answer is Yes is as below }}<br />
<br />
The answer is Yes. Consider that you have two classes with different shapes. Given a data point, justify which class this point belongs to. You just do the transformations corresponding to the 2 classes respectively, then you get <math>\,\delta_1 ,\delta_2 </math> ,then you determine which class the data point belongs to by comparing <math> \,\delta_1 </math> and <math> \,\delta_2 </math> .<br />
<br />
In summary, to apply QDA on a data set <math>\,X</math>, in the general case where <math>\, \Sigma_k \ne I </math> for each class <math>\,k</math>, one can proceed as follows:<br />
<br />
:: Step 1: For each class <math>\,k</math>, apply singular value decomposition on <math>\,X_k</math> to obtain <math>\,S_k</math> and <math>\,U_k</math>.<br />
<br />
:: Step 2: For each class <math>\,k</math>, transform each <math>\,x</math> belonging to that class to <math>\,x^* = S_k^{-\frac{1}{2}}U_k^\top x</math>, and transform its center <math>\,\mu_k</math> to <math>\,\mu_k^* = S_k^{-\frac{1}{2}}U_k^\top \mu_k</math>.<br />
<br />
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x^*</math> and the transformed center <math>\,\mu^*</math> of each class, and assign <math>\,x</math> to the class such that the squared Euclidean distance between <math>\,x^*</math> and <math>\,\mu^*</math> is the least over all of the classes.<br />
<br />
<br />
Now, let us consider LDA. <br />
Here, one can derive a classification scheme that is quite similar to that shown above. The main difference is the assumption of a common variance across the classes, so we perform the Singular Value Decomposition once, as opposed to k times.<br />
<br />
To apply LDA on a data set <math>\,X</math>, one can proceed as follows:<br />
<br />
:: Step 1: Apply singular value decomposition on <math>\,X</math> to obtain <math>\,S</math> and <math>\,U</math>.<br />
<br />
:: Step 2: For each <math>\,x \in X</math>, transform <math>\,x</math> to <math>\,x^* = S^{-\frac{1}{2}}U^\top x</math>, and transform each center <math>\,\mu</math> to <math>\,\mu^* = S^{-\frac{1}{2}}U^\top \mu</math>.<br />
<br />
:: Step 3: For each data point <math>\,x \in X</math>, find the squared Euclidean distance between the transformed data point <math>\,x^*</math> and the transformed center <math>\,\mu^*</math> of each class, and assign <math>\,x</math> to the class such that the squared Euclidean distance between <math>\,x^*</math> and <math>\,\mu^*</math> is the least over all of the classes.<br />
<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In actual data scenarios, it is generally true that QDA will provide a better classifier for the data then LDA because QDA does not assume that the covariance matrix for each class is identical, as LDA assumes. However, QDA still assumes that the class conditional distribution is Gaussian, which is not always the case in real-life scenarios. The link provided at the beginning of this paragraph describes a kernel-based QDA method which does not have the Gaussian distribution assumption.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA - September 28, 2010==<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension. Pay attention, We don't do QDA with LDA. If we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below applies LDA to the same data set and reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that the algorithm created to separate the data into the two classes.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the classes of the points 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x*y+%g*y^2', k, l(1), l(2), q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that QDA is only correct in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve produced by QDA that do not lie on the correct side of the line produced by LDA.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Principal Component Analysis - September 30, 2010==<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br />
<br /><br />
Principal component analysis (PCA) is a dimensionality-reduction method invented by [http://en.wikipedia.org/wiki/Karl_Pearson Karl Pearson] in 1901 [http://stat.smmu.edu.cn/history/pearson1901.pdf]. Depending on where this methodology is applied, other common names of PCA include the [http://en.wikipedia.org/wiki/Karhunen%E2%80%93Lo%C3%A8ve_theorem Karhunen–Loève transform (KLT)] , the [http://en.wikipedia.org/wiki/Harold_Hotelling Hotelling transform], and the proper orthogonal decomposition (POD). PCA is the simplist [http://en.wikipedia.org/wiki/Eigenvector eigenvector]-based [http://en.wikipedia.org/wiki/Multivariate_analysis multivariate analysis]. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or '''principal components''') of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could. <br />
<br />
<br />
Furthermore, if one considers the lower dimensional representation produced by PCA as a least squares fit of our original data, then it can also be easily shown that this representation is the one that minimizes the reconstruction error of our data. It should be noted however, that one usually does not have control over which dimensions PCA deems to be the most informative for a given set of data, and thus one usually does not know which dimensions PCA selects to be the most informative dimensions in order to create the lower-dimensional representation. <br />
<br />
<br />
Suppose <math>\,X</math> is our data matrix containing <math>\,d</math>-dimensional data. The idea behind PCA is to apply [http://en.wikipedia.org/wiki/Singular_value_decomposition singular value decomposition] to <math>\,X</math> to replace the rows of <math>\,X</math> by a subset of it that captures as much of the [http://en.wikipedia.org/wiki/Variance variance] in <math>\,X</math> as possible. First, through the application of singular value decomposition to <math>\,X</math>, PCA obtains all of our data's directions of variation. These directions would also be ordered from left to right, with the leftmost directions capturing the most amount of variation in our data and the rightmost directions capturing the least amount. Then, PCA uses a subset of these directions to map our data from its original space to a lower-dimensional space. <br />
<br />
<br />
By applying singular value decomposition to <math>\,X</math>, <math>\,X</math> is decomposed as <math>\,X = U\Sigma V^T \,</math>. The <math>\,d</math> columns of <math>\,U</math> are the [http://en.wikipedia.org/wiki/Eigenvector eigenvectors] of <math>\,XX^T \,</math>.<br />
The <math>\,d</math> columns of <math>\,V</math> are the eigenvectors of <math>\,X^TX \,</math>. The <math>\,d</math> diagonal values of <math>\,\Sigma</math> are the square roots of the [http://en.wikipedia.org/wiki/Eigenvalue eigenvalues] of <math>\,XX^T \,</math> (also of <math>\,X^TX \,</math>), and they correspond to the columns of <math>\,U</math> (also of <math>\,V</math>). <br />
<br />
<br />
We are interested in <math>\,U</math>, whose <math>\,d</math> columns are the <math>\,d</math> directions of variation of our data. Ordered from left to right, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most informative direction of variation of our data. That is, the <math>\,ith</math> column of <math>\,U</math> is the <math>\,ith</math> most effective column in terms of capturing the total variance exhibited by our data. A subset of the columns of <math>\,U</math> is used by PCA to reduce the dimensionality of <math>\,X</math> by projecting <math>\,X</math> onto the columns of this subset. In practice, when we apply PCA to <math>\,X</math> to reduce the dimensionality of <math>\,X</math> from <math>\,d</math> to <math>\,k</math>, where <math>k < d\,</math>, we would proceed as follows:<br />
<br />
:: Step 1: Center <math>\,X</math> so that it would have zero mean.<br />
<br />
:: Step 2: Apply singular value decomposition to <math>\,X</math> to obtain <math>\,U</math>.<br />
<br />
:: Step 3: Suppose we denote the resulting <math>\,k</math>-dimensional representation of <math>\,X</math> by <math>\,Y</math>. Then, <math>\,Y</math> is obtained as <math>\,Y = U_k^TX</math>. Here, <math>\,U_k</math> consists of the first (leftmost) <math>\,k</math> columns of <math>\,U</math> that correspond to the <math>\,k</math> largest diagonal elements of <math>\,\Sigma</math>.<br />
<br />
<br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 130 images (28 by 23 pixels) of handwritten threes. <br />
{{Cleanup|date=September 6 2010|reason=If anyone can tell me where I can find the 2-3 data set, I would create the new image. In the mean time, I found a non-copyrighted image of different looking 3s online, but as you can see, it is not as nice as one we could make.}}<br />
{{Cleanup|date=September 6 2010|reason=I think you can find it on your UW-ACE account for this course.}}<br />
<br />
[[File:Handwritten 3s.gif]]<br />
<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 130 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[Image:23plotPCA.jpg]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slant of each handwritten digit.<br />
<br />
===Derivation of the first Principle Component===<br />
{{Cleanup|date=October 2010|reason=I think English of this section must be improved}}<br />
We want to find the direction of maximum variation. Let <math>\begin{align}\textbf{w}\end{align}</math> be an arbitrary direction, <math>\begin{align}\textbf{x}\end{align}</math> a data point and <math>\begin{align}\displaystyle u\end{align}</math> the length of the projection of <math>\begin{align}\textbf{x}\end{align}</math> in direction <math>\begin{align}\textbf{w}\end{align}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\begin{align}\textbf{w}\end{align}</math> is the same as <math>\begin{align}c\textbf{w}\end{align}</math>, for any scalar <math>c</math>, so without loss of generality, we assume that: <br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}.<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}. <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} .</math><br />
<br /><br /><br />
The above is the variance of <math>\begin{align}\displaystyle u \end{align}</math> formed by the weight vector <math>\begin{align}\textbf{w} \end{align}</math>. The first principal component is the vector <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>\begin{align}\textbf{w} \end{align}</math> that maximizes the function. Our goal is to find the weight <math>\begin{align}\textbf{w} \end{align}</math> that maximizes this variability, subject to a constraint. Since our function is convex, it has no maximum value. Therefore we need to add a constraint that restricts the length of <math>\begin{align}\textbf{w} \end{align}</math>. However, we are only interested in the direction of the variability, so the problem becomes<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
s.t. <math>\textbf{w}^T \textbf{w} = 1.</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|.<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
====Lagrange Multiplier====<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector then the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
{{Cleanup|date=October 2010|reason=It is good discussion, what will happen if we don't have distinct eigenvalues and eigenvectors? What does this situation mean? }}<br />
{{Cleanup|date=October 2010|reason=If the eigenvalues are not distinct, I suppose we could still take the leftmost eigenvector by default. Not sure if this is the correct approach, so can anyone please explain further? Thanks }}<br />
{{Cleanup|date=October 2010|reason= As U is the eigenvector of a symetric matrix, is it possible that we have 2 similar eigen vector? }}<br />
<br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = Var(\sum_{i = 1}^n x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
m_X=mean(X,2);<br />
mm=repmat(m_X,1,300);<br />
XX=X-mm;<br />
[u s v] = svd(XX);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
xHat=xHat+mm;<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. This is because almost all of the noise in the noisy image is captured by the principal components (directions of variation) that capture the least amount of variation in the image, and these principal components were discarded when we used the few principal components that capture most of the image's variation to generate the image's lower-dimensional representation. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Extraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data.<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized as follows (taken from the Lecture Slides).<br />
<br />
====Algorithm ====<br />
'''Recover basis:''' Calculate <math> XX^T =\Sigma_{i=1}^{n} x_i x_{i}^{T} </math> and let <math> U=</math> eigenvectors of <math> X X^T </math> corresponding to the top <math> d </math> eigenvalues.<br />
<br />
'''Encoding training data:''' Let <math>Y=U^TX </math> where <math>Y</math> is a <math>d \times n</math> matrix of encoding of the original data.<br />
<br />
'''Reconstructing training data:''' <math>\hat{X}= UY=UU^TX </math>.<br />
<br />
'''Encode set example:''' <math> y=U^T x </math> where <math> y </math> is a <math>d-</math>dimentional encoding of <math>x</math>.<br />
<br />
'''Reconstruct test example:''' <math>\hat{x}= Uy=UU^Tx </math>.<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{D \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, U_dY_{d \times n} </math>, where <math>\,U_d</math> contains the first (leftmost) <math>\,d</math> columns of <math>\,U</math>.<br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
== Fisher's (Linear) Discriminant Analysis (FDA) - Two Class Problem - October 5, 2010 ==<br />
<br />
===Sir Ronald A. Fisher===<br />
Fisher's Discriminant Analysis (FDA), also known as Fisher's Linear Discriminant Analysis (LDA) in some sources, is a classical feature extraction technique. It was originally described in 1936 by Sir [http://en.wikipedia.org/wiki/Ronald_A._Fisher Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here]. <br />
In this paper Fisher used for the first time the term DISCRIMINANT FUNCTION. The term DISCRIMINANT ANALYSIS was introduced later by Fisher himself in a subsequent paper which can be found [http://digital.library.adelaide.edu.au/coll/special//fisher/155.pdf here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
As in PCA, the goal of FDA is to project the data in a lower dimension. You might ask, why was FDA invented when PCA already existed? There is a simple explanation for this that can be found [http://www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf here]. PCA is an unsupervised method for classification, so it does not take into account the labels in the data. Suppose we have two clusters that have very different or even opposite labels from each other but are nevertheless positioned in a way such that they are very much parallel to each other and also very near to each other. In this case, most of the total variation of the data is in the direction of these two clusters. If we use PCA in cases like this, then both clusters would be projected onto the direction of greatest variation of the data to become sort of like a single cluster after projection. PCA would therefore mix up these two clusters that, in fact, have very different labels. What we need to do instead, in this cases like this, is to project the data onto a direction that is orthogonal to the direction of greatest variation of the data. This direction is in the least variation of the data. On the 1-dimensional space resulting from such a projection, we would then be able to effectively classify the data, because these two clusters would be perfectly or nearly perfectly separated from each other taking into account of their labels. This is exactly the idea behind FDA. <br />
<br />
The main difference between FDA and PCA is that, in FDA, in contrast to PCA, we are not interested in retaining as much of the variance of our original data as possible. Rather, in FDA, our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for a direction that is most representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
Suppose we have 2-dimensional data, then FDA would attempt to project the data of each class onto a point in such a way that the resulting two points would be as far apart from each other as possible. Intuitively, this basic idea behind FDA is the optimal way for separating each pair of classes along a certain direction. <br />
<br />
{{Cleanup|date=October 2010|reason= Just a thought: how relevant is "Dimensionality reduction techniques" to the concept of "subspace clustering"? As in subspace clustering, the goal is to find a set of features (relevant features, the concept is referred to as local feature relevance in the literature) in the high dimensional space, where potential subspaces accommodating different classes of data points can be defined. This means; the data points are dense when they are considered in a subset of dimensions (features).}}<br />
{{Cleanup|date=October 2010|reason=If I'm not mistaken, classification techniques like FDA use labeled training data whereas clustering techniques use unlabeled training data instead. Any other input regarding this would be much appreciated. Thanks}}<br />
{{Cleanup|date=October 2010|reason=An extension of clustering is subspace clustering in which different subspace are searched through to find the relavant and appropriate dimentions. High dimentional data sets are roughly equiedistant from each other, so feature selection methods are used to remove the irrelavant dimentions. These techniques do not keep the relative distance so PCA is not useful for these applications. It should be noted that subspace clustering localize their search unlike feature selection algorithms.for more information click here[http://portal.acm.org/citation.cfm?id=1007731]}}<br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes:<br />
<br><br />
For a 2-classes problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k-classes problem, we want to reduce the data to k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within-variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
{{Cleanup|date=October2010|reason=Anyone please add an example to make the comparison clearer}}<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. In the paper,"[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] " written by our instructor, they propose a closed-form solution to one algorithm that previously required expensive semidefinite optimization. They provide a new problem setup in which the algorithm performs better or as well as some standard methods, but without the computational complexity. Furthermore, they show a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). They also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
===FDA Goals===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
==== Example in R ====<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
<br />
FDA projects the data into lower dimensional space, where the distances between the projected means are maximum and the within-class variances are minimum. There are two categories of classification problems:<br />
<br />
1. Two-class problem<br />
<br />
2. Multi-class problem (addressed next lecture)<br />
<br />
=== Two-class problem ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within each class''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances (the summation of the two covariances is a valid covariance, satisfying the symmetry and positive semi-definite criteria). <br />
<br />
{{Cleanup|date=October 2010|reason=In 2. above, I wonder if the computation would be much more complex if we instead find a weighted sum of the covariances of the two classes where the weights are the sizes of the two classes?}}<br />
<br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> where <math>\ z_i </math> is a scalar<br />
<br />
====1. Minimizing within-class variance==== <br />
<math>\displaystyle \min_w (\underline{w}^T\sum_1\underline{w}) </math><br />
<br />
<math>\displaystyle \min_w (\underline{w}^T\sum_2\underline{w}) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math><br />
<br> (where <math>\,\sum_1</math> and <math>\,\sum_2 </math> are the covariance matrices of the 1st and 2nd classes of data, respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within-classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math>.<br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br /> <br />
<math>\displaystyle \max_w ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2, </math><br />
<br /><br /><br />
<math>\begin{align} ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2 &= (\underline{w}^T \mu_1 - \underline{w}^T \mu_2)^T(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\<br />
&= (\mu_1^T\underline{w} - \mu_2^T\underline{w})(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\<br />
&= (\mu_1 - \mu_2)^T \underline{w} \underline{w}^T (\mu_1 - \mu_2) \\<br />
<br />
&= ((\mu_1 - \mu_2)^T \underline{w})^{T} (\underline{w}^T (\mu_1 - \mu_2))^{T} \\<br />
&= \underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T \underline{w} \end{align}</math><br /><br />
<br />
Note that in the last line above the order is rearranged clockwise because the answer is a scalar.<br />
<br />
Let <math>\displaystyle s_B = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math>, the between-class covariance, then the goal is to <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math>.<br />
<br />
===The Objective Function for FDA===<br />
We want an objective function which satisfies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})</math> or <math>\displaystyle \min_w (\underline{w}^Ts_w\underline{w})</math><br />
# <math>\displaystyle \max_w (\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w}) </math> or <math>\displaystyle \max_w (\underline{w}^Ts_B\underline{w})</math> <br />
<br /><br /><br />
So, we construct our objective function as maximizing the ratio of the two goals brought above:<br /><br />
<br /><br />
<math>\displaystyle \max_w \frac{(\underline{w}^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w})} {(\underline{w}^T(\sum_1 + \sum_2)\underline{w})} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max_w \frac{(\underline{w}^Ts_B\underline{w})}{(\underline{w}^Ts_w\underline{w})}</math> <br /><br />
One may argue that we can use subtraction for this purpose, while this approach is true but it can be shown it will need another scaling factor. Thus using this ratio is more efficient.<br />
<br />
As the objective function is convex, and so it does not have a maximum. To get around this problem, we have to add the constraint that w must have unit length, and then solvethis optimization problem we form the lagrangian:<br />
<br />
<br /><br /><br />
<math>\displaystyle L(\underline{w},\lambda) = \underline{w}^Ts_B\underline{w} - \lambda (\underline{w}^Ts_w\underline{w} -1)</math><br /><br /><br />
<br />
<br /><br />
Then, we equate the partial derivative of L with respect to <math>\underline{w}</math>:<br />
<math>\displaystyle \frac{\partial L}{\partial \underline{w}}=2s_B \underline{w} - 2\lambda s_w \underline{w} = 0 </math> <br /><br />
<br />
<math>s_B \underline{w} = \lambda s_w \underline{w}</math><br /><br />
<math>s_w^{-1}s_B \underline{w}= \lambda\underline{w}</math><br /><br /><br />
This is in the form of generalized eigenvalue problem. Therefore, <math> \underline{w}</math> is the largest eigenvector of <math>s_w^{-1}s_B </math><br /><br />
<br />
This solution can be further simplified as follow:<br /><br />
<br />
<math>s_w^{-1}(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T\underline{w} = \lambda\underline{w} </math><br /><br />
<br />
Since <math>(\mu_1 - \mu_2)^T\underline{w}</math> is a scalar then <math>s_w^{-1}(\mu_1 - \mu_2)</math>∝<math>\underline{w}</math> <br /><br /><br />
This gives the direction of <math>\underline{w}</math> without doing eigenvalue decomposition in the case of 2-class problem.<br />
<br />
Note: In order for <math>{s_w}</math> to have an inverse, it must have full rank. This can be achieved by ensuring that the number of data points <math>\,\ge</math> the dimensionality of <math>\underline{x_{i}}</math>.<br />
<br />
===FDA Using Matlab===<br />
Note: ''The following example was not actually mentioned in this lecture''<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods. <br />
%First of all, we generate the two data set:<br />
% First data set X1<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300);<br />
%In this case: <br />
mu_1=[1;1]; <br />
Sigma_1=[1 1.5; 1.5 3]; <br />
%where mu and sigma are the mean and covariance matrix.<br />
% Second data set X2<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300); <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
plot(X1(:,1),X1(:,2),'.b'); hold on;<br />
plot(X2(:,1),X2(:,2),'ob')<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
% Combine data sets to map both into the same subspace<br />
X=[X1;X2];<br />
X=X';<br />
% We used built-in PCA function in Matlab<br />
[coefs, scores]=princomp(X);<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=sw\[4; 2] % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is very little overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
===Some of FDA applications===<br />
There are many applications for FDA in many domains some of them are stated below:<br />
<br />
* SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS<br />
FDA can be used to enhance listening comprehension when the user goes from a sound<br />
environment to another different one. For more information review this paper by Alexandre et al.[http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/papers/1569101740.pdf here]<br />
<br />
* Application to Face Recognition<br />
FDA can be used in face recognition at different situation. Using FDA Kong et al. proposes an Application to Face<br />
Recognition with Small Number of Training Samples [http://person.hst.aau.dk/pimuller/2D_FDA_Face_CVPR05fish.pdf here].<br />
<br />
* Palmprint Recognition<br />
FDA is used in biometrics, to implement an automated palmprint recognition system. See An Automated Palmprint Recognition System by Tee et al. [http://www.sciencedirect.com/science?_ob=MImg&_imagekey=B6V09-4FJ5XPN-1-1&_cdi=5641&_user=1067412&_pii=S0262885605000089&_origin=search&_coverDate=05%2F01%2F2005&_sk=999769994&view=c&wchp=dGLbVzz-zSkWb&md5=a064b67c9bdaaba7e06d800b6c9b209b&ie=/sdarticle.pdf here].<br />
<br />
{{Cleanup|date=October 2010|reason=I think briefing about the other applications would be easier than browsing through all of these applications}}<br />
<br />
{{Cleanup|date=October 2010|reason= This link is no longer valid.}}<br />
<br />
other applications could found in references 4,5,6,7,8 and more in [http://www.sciencedirect.com/science?_ob=ArticleListURL&_method=list&_ArticleListID=1489148820&_sort=r&_st=13&view=c&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=f210273546a659c90ae0962fce7b8b4e&searchtype=a here]<br />
<br />
=== '''References'''===<br />
1. Kong, H.; Wang, L.; Teoh, E.K.; Wang, J.-G.; Venkateswarlu, R.; , "A framework of 2D Fisher discriminant analysis: application to face recognition with small number of training samples," Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , vol.2, no., pp. 1083- 1088 vol. 2, 20-25 June 2005<br />
doi: 10.1109/CVPR.2005.30<br />
[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1467563&isnumber=31473 1]<br />
<br />
2. Enrique Alexandre, Roberto Gil-Pita, Lucas Cuadra, Lorena A´lvarez, Manuel Rosa-Zurera, "SPEECH/MUSIC/NOISE CLASSIFICATION IN HEARING AIDS USING A TWO-LAYER CLASSIFICATION SYSTEM WITH MSE LINEAR DISCRIMINANTS", 16th European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland, August 25-29, 2008, copyright by EURASIP, [http://www.eurasip.org/Proceedings/Eusipco/Eusipco2008/welcome.html 2]<br />
<br />
3. Connie, Tee; Jin, Andrew Teoh Beng; Ong, Michael Goh Kah; Ling, David Ngo Chek; "An automated palmprint recognition system", Journal of Image and Vision Computing, 2005. [http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V09-4FJ5XPN-1&_user=1067412&_coverDate=05/01/2005&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_searchStrId=1489147048&_rerunOrigin=google&_acct=C000051246&_version=1&_urlVersion=0&_userid=1067412&md5=a781a68c29fbf127473ae9baa5885fe7&searchtype=a 3]<br />
<br />
4. met, Francesca; Boqué, Ricard; Ferré, Joan; "Application of non-negative matrix factorization combined with Fisher's linear discriminant analysis for classification of olive oil excitation-emission fluorescence spectra", Journal of Chemometrics and Intelligent Laboratory Systems, 2006.<br />
[http://www.sciencedirect.com/science/article/B6TFP-4HR769Y-1/2/b5244d459265abb3a1bf5238132c737e 4]<br />
<br />
5. Chiang, Leo H.;Kotanchek, Mark E.;Kordon, Arthur K.; "Fault diagnosis based on Fisher discriminant analysis and support vector machines"<br />
Journal of Computers & Chemical Engineering, 2004<br />
[http://www.sciencedirect.com/science/article/B6TFT-4B4XPRS-1/2/bca7462236924d29ea23ec633a6eb236 5]<br />
<br />
6. Yang, Jian ;Frangi, Alejandro F.; Yang, Jing-yu; "A new kernel Fisher discriminant algorithm with application to face recognition", 2004<br />
[http://www.sciencedirect.com/science/article/B6V10-4997WS1-1/2/78f2d27c7d531a3f5faba2f6f4d12b45 6]<br />
<br />
7. Cawley, Gavin C.; Talbot, Nicola L. C.; "Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers", Journal of Pattern Recognition , 2003 [http://www.sciencedirect.com/science/article/B6V14-492718R-1/2/bd6e5d0495023a1db92ab7169cc96dde 7]<br />
<br />
8. Kodipaka, S.; Vemuri, B.C.; Rangarajan, A.; Leonard, C.M.; Schmallfuss, I.; Eisenschenk, S.; "Kernel Fisher discriminant for shape-based classification in epilepsy" Hournal Medical Image Analysis, 2007. [http://www.sciencedirect.com/science/article/B6W6Y-4MH8BS0-1/2/055fb314828d785a5c3ca3a6bf3c24e9 8]<br />
<br />
==Fisher's (Linear) Discriminant Analysis (FDA) - Multi-Class Problem - October 7, 2010==<br />
<br />
====Obtaining Covariance Matrices====<br />
<br />
<br />
The within-class covariance matrix <math>\mathbf{S}_{W}</math> is easy to obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between-class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to compute directly. To bypass this problem we use the following method. We know that the total covariance <math>\,\mathbf{S}_{T}</math> of a given set of data is constant and known, and we can also decompose this variance into two parts: the within-class variance <math>\mathbf{S}_{W}</math> and the between-class variance <math>\mathbf{S}_{B}</math> in a way that is similar to [http://en.wikipedia.org/wiki/Analysis_of_variance ANOVA]. We thus have:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
where the total variance is given by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = <br />
\frac{1}{n}<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
We can now get <math>\mathbf{S}_{B}</math> from the relationship: <br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
<br />
Actually, there is another way to obtain <math>\mathbf{S}_{B}</math>. Suppose the data contains <math>\, k </math> classes, and each class <math>\, j </math> contains <math>\, n_{j} </math> data points. We denote the overall mean vector by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\frac{1}{n} \sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within-class covariance <math>\mathbf{S}_{W}</math><br />
and the between-class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term in the final line of the derivation above as the between-class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^* =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})^{T})<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}-(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i}))((\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W})<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the following as our measure:<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
{{Cleanup|date=What if we encounter complex eigenvalues? Then concept of being large does not dense. What is the solution in that case? }}<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
<br />
Recall that the Frobenius norm of <math>X</math> is <br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2_{2} = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following classic criterion function that Fisher used<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
{{Cleanup|date=October 2010|reason=Adding more general comments about the advantages and flaws of FDA would be effective here.}}<br />
<br />
{{Cleanup|date=October 2010|reason=Would you please show how could we reconstruct our original data from the data that its dimentionality is reduced by FDA.}}<br />
{{Cleanup|date=October 2010|reason= When you reduce the dimensionality of data in most general form you lose some features of the data and you cannot reconstruct the data from redacted space unless the data have special features that help you in reconstruction like sparsity. In FDA it seems that we cannot reconstruct data in general form using reducted version of data }}<br />
<br />
===Generalization of Fisher's Linear Discriminant Analysis ===<br />
<br />
Fisher's Linear Discriminant Analysis (Fisher, 1936) is very popular among users of discriminant analysis. Some of the reasons for this are its simplicity<br />
and lack of necessity for strict assumptions. However, it has optimality properties only if the underlying distributions of the groups are multivariate normal. It is also easy to verify that the discriminant rule obtained can be very harmed by only a small number of outlying observations. Outliers are very hard to detect in multivariate data sets and even when they are detected simply discarding them is not the most efficient way of handling the situation. Therefore, there is a need for robust procedures that can accommodate the outliers and are not strongly affected by them. Then, a generalization of Fisher's linear discriminant algorithm [[http://www.math.ist.utl.pt/~apires/PDFs/APJB_RP96.pdf]]is developed to lead easily to a very robust procedure.<br />
<br />
Also notice that LDA can be seen as a dimensionality reduction technique. In general k-class problems, we have k means which lie on a linear subspace with dimension k-1. Given a data point, we are looking for the closest class mean to this point. In LDA, we project the data point to the linear subspace and calculate distances within that subspace. If the dimensionality of the data, d, is much larger than the number of classes, k, then we have a considerable drop in dimension.<br />
<br />
==Linear and Logistic Regression - October 12, 2010==<br />
<br />
===Linear Regression===<br />
Linear regression is an approach for modeling the scalar value <math> y</math> from a set of dependent variable <math>X</math>. In linear regression the goal is to find an appropriate set of dependent variables to <math> y</math> and try to estimate its value from the related set. While in classification the goal is to classify data to different groups in which the inner similarity among the group members are more than variables which belong to different groups.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
According to Bayes Classification, <br/><br />
<math>P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The simple linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
y_i = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
and we can denote it as<br />
:<math><br />
\begin{align}<br />
\mathbf{y} = \beta^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
where <math>\,\beta^{T} = (<br />
\beta_1,..., \beta_{d},\beta_0)</math> is a <math>1 \times (d+1)</math> vector and <math>\mathbf{X}=<br />
\begin{pmatrix}<br />
\mathbf{x}_{1}, \dots,\mathbf{x}_{n}\\<br />
1, \dots, 1<br />
\end{pmatrix}<br />
</math> is a <math>(d+1) \times n</math> Matrix,here <math>\mathbf{x}_{i} </math> is a <math>d \times 1</math> vector<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{n}</math> and <math>\,y_{1}, ..., y_{n}</math> our goal is to find <math>\,\beta^{T}</math> such that the linear model fits the data while minimizing sum of squared errors using the [http://en.wikipedia.org/wiki/Least_squares Least Squares method].<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
We then try to minimize the residual sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\beta^{T}\mathbf{X})^{T}(\mathbf{y}-\beta^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}(\mathbf{y}-\beta^{T}\mathbf{X})^{T}<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}(\mathbf{y}^{T}-\mathbf{X}^{T}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}\mathbf{X}^{T})^{-1}\mathbf{X}\mathbf{y}^{T}<br />
\end{align}<br />
</math><br />
<br />
{{Cleanup|date=12 Oct 2010|reason=we use :<math>\begin{align}<br />
\mathbf{y} = \beta^{T}\mathbf{X}<br />
\end{align}</math> in this course, but <br />
:<math>\begin{align}<br />
\mathbf{y} = \mathbf{X}\beta<br />
\end{align}</math> were used by the notes last year and then it has the result below. I am confused these results seem to hold if we use <math> \beta^{T}X </math> rather than <math>X\Beta</math>. If we use <math>X\Beta</math> then X has to be a n x (d+1) matrix and <math>\Beta</math> is a (d+1) x 1 vector. Has this cleanup been fixed or does it still stand?}}<br />
{{Cleanup|date=12 Oct 2010|reason=It depends on the definition of :<math>\begin{align}<br />
\beta^{T}\end{align}</math> and/or <math>\begin{align}\beta\end{align}</math> as the second is nothing but the first, added to it :<math>\begin{align}\beta0\end{align}</math>, then :<math>\begin{align}\beta\end{align}</math> = [<math>\begin{align}<br />
\beta0\end{align}</math> <math>\begin{align}\beta^{T}\end{align}</math>] I think this is close to the right explanation, please make sure of it and post the right thing}}<br />
{{Cleanup|date=19 Oct 2010|reason=in my notes, i have <math> \hat\mathbf{y} = \hat\beta\mathbf{X}</math> NOT <math> \hat\mathbf{y} = \hat\beta^{T}\mathbf{X} </math>, I've fixed it below. The class last year defined their <math> X </math> and <math> \beta </math> differently. If you pay close attention to their definitions, you'll see their equations match up and so does ours!}}<br />
<br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \hat\beta\mathbf{X} = <br />
\mathbf{y}\mathbf{X}^{T}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}^{T}(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X} </math> is called the [http://en.wikipedia.org/wiki/Hat_matrix hat matrix].<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1. This is more direct approach to classification since it do not need to estimate <math>\ f_k(x) </math> and <math>\ \pi_k </math>.<br />
<math>\ 1 \times P(Y=1|X=x)+0 \times P(Y=0|X=x)=E(Y|X) </math> <br />
This model does not classify Y between 0 and 1, so it is not good and sometimes it can lead to a decent classifier. <math>\ y_i=\frac{1}{n_1} </math> <math>\ \frac{-1}{n_2} </math><br />
<br />
===Logistic Regression===<br />
The [http://en.wikipedia.org/wiki/Logistic_regression logistic regression] model arises from the desire to model the posterior probabilities of the <math>\displaystyle K</math> classes via linear functions in <math>\displaystyle x</math>, while at the same time ensuring that they sum to one and remain in [0,1].Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood ,using <math>\displaystyle Pr(Y|X)</math>. Since <math>\displaystyle Pr(Y|X)</math> completely specifies the conditional distribution, the multinomial distribution is appropriate. This model is widely used in biostatistical applications for two classes. For instance: people survive or die, have a disease or not, have a risk factor or not.<br />
<br />
==== logistic function ====<br />
[[File:200px-Logistic-curve.svg.png | Logistic Sigmoid Function]]<br />
<br />
<br />
<br />
A [http://en.wikipedia.org/wiki/Logistic_function logistic function] or logistic curve is the most common sigmoid curve. <br />
<br />
1. <math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
2. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
3. <math>y(0) = \frac{1}{2}</math><br />
<br />
4. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
5. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
====Intuition behind Logistic Regression====<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
====The Logistic Regression Model====<br />
<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
{{Cleanup|date=October 13 2010|reason=It could be useful to have sources for these graphs. We don't know if they are copyrighted}}<br />
{{Cleanup|date=October 18 2010|reason=I Could not find any source for these graphs. However, they following the definition of the defined probability. I don't think the generated graph as it is here is copyrighted, but if you worried you can draw this figure by applying the function and post the result.}}<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
{{Cleanup|date=October 13 2010|reason=It could be useful to have sources for these graphs. We don't know if they are copyrighted}}<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
====Fitting a Logistic Regression====<br />
Logistic regression tries to fit a distribution. The common practice in statistics is to fit density function, posterior density of each class(Pr(Y|X), to data using [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math> <br />
<br />
There are n+1 nonlinear equations in <math>/ \beta </math>. The first column is vector 1, then <math>\ \sum_{i=1}^n {y_i} =\sum_{i=1}^n p(\underline{x}_i;\underline{\beta}) </math> i.e. the expected number of class ones matches the observed number.<br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
====Extension====<br />
<br />
* When we are dealing with a problem with more than two classes, we need to generalize our logistic regression to a [http://en.wikipedia.org/wiki/Multinomial_logit Multinomial Logit model].<br />
<br />
* Limitations of Logistic Regression:<br />
:1. We know that there is no assumptions are made about the distributions of the features of the data (i.e. the explanatory variables). However, the features should not be highly correlated with one another because this could cause problems with estimation.<br />
:2. Large number of data points (i.e.the sample sizes) are required for logistic regression to provide sufficient numbers in both classes. The more number of features/dimensions of the data, the larger the sample size required.<br />
<br />
==Lecture summary==<br />
{{Cleanup|date=October 18 2010|reason=Can anybody provide a better lecture summary? The one below is to just get it started}}<br />
In this lecture an introduction of the linear regression was presented as well as defining the density function for two-class problem. Maximum likelihood was used to define the distribution parameters (i.e. fitting density function to the logistic class.<br />
<br />
== Logistic Regression Cont. - October 14, 2010 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Estimating Parameters <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
'''Newton-Raphson Algorithm:'''<br /><br />
<br />
If we want to find <math>\ x^* </math> such that <math>\ f(x^*)=0</math><br />
<br />
We first pick a starting point <math>x^* = x^{old}</math> and and we solve:<br />
<br \><br />
<br />
<math>\ x^{*} \leftarrow x^{old}-\frac {f(x^{old})}{\partial f(x^{old})} </math> <br /><br />
<math> \ x^{old} \leftarrow x^{*}</math> <br />
<br /><br />
This is repeated till convergence <br />
<br />
If we want to maximize or minimize <math>\ f(x) </math>, then solve for <math>\ \partial f(x)=0 </math><br />
<br />
<math>\ X^{new} \leftarrow x^{old}-\frac {\partial f(x^{old})}{\partial^2 f(x^{old})} </math><br />
<br />
<br /><br />
<br />
In vector notation the above can be written as <br /><br />
<br />
<math><br />
X^{new} \leftarrow X^{old} - H^{-1}\Delta<br />
</math><br />
<br /><br />
H is the [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix] or second derivative matrix and <math>\Delta</math> is the gradient both evaluated at <math>X^{old}</math> <br />
<br /><br />
<br />
'''note:''' If the Hessian is not invertible the [http://en.wikipedia.org/wiki/Generalized_inverse generalized inverse] or pseudo inverse can be used<br />
<br /><br />
<br /><br />
<br />
<br />
As shown above ,the [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or Hessian.<br />
<br />
<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
This is a adjusted response and it is solved repeatedly when <math>\ p </math>, <math>\ W </math>, and <math>\ z </math> changes. This algorithm is called [http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares iteratively reweighted least squares] because it solves the weighted least squares problem repeatedly.<br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow arg \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaries.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1. There exists a closed form solution for least squares.<br />
<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1. No closed form solution exists<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quadratically w.r.t dimension.<br />
#LDA estimate parameters more efficiently by using more information about data and samples without class labels can be also used in LDA. <br />
<br />
{{Cleanup|date=October 2010|reason= Could somebody please validate the following points}} <br />
<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
===Lecture Summary===<br />
<br />
Traditionally logistic regression parameters are estimated using maximum likelihood. However , other optimization techniques may be used as well.<br />
<br /><br />
Since there is no closed form solution for finding the zero of the first derivative of the log likelihood the Newton Raphson algorithm is used. Since the problem is convex Newtons is guaranteed to converge to a global optimum.<br />
<br /><br />
Logistic regression requires less parameters than LDA or QDA and is therefore more favorable for high dimensional data.<br />
<br />
===Supplements===<br />
<br />
A detailed proof that logistic regression is convex is available [http://people.csail.mit.edu/jrennie/writing/convexLR.pdf here]. See '1 Binary LR' for the case we discussed in lecture.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6437stat841f102010-10-02T22:38:58Z<p>Hclam: /* FDA vs. PCA Example in Matlab */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is useful in dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as the black line in the diagram (this is PCA)<br />
>> s <- svd(X,nu=1,nv=1)<br />
<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
>> s2 <- lda(X,grouping=Y)<br />
<br />
: Now that we've calculated the PCA and FLDA decompositions, we can create a plot to demonstrate the differences between the two algorithms. <br />
<br />
: Plot the set of points, according to colours given in Y.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
<br />
: Plot the FLDA direction, again through the mean.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
<br />
FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This<br />
approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. The paper, "[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] ", written by our instructor, proposes a closed-form solution to one algorithm that previously required expensive semidefinite optimization. It provides a new problem setup in which the algorithm performs better or as well as some standard methods, minus the complexity in computational. Furthermore, the paper demonstrates a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). It also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
=== Two-class Problem ===<br />
<br />
In the two-class problem, we have prior knowledge that the data points belong to two classes. Intuitively speaking, points from each class form a cloud around the mean of the class. Individual classes having possibly different size. To be able to separate the two classes, we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{1}</math> the covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{2}</math> the covariance of the 2nd class.<br />
<br />
We have to find a transformation which satisfies the following goals:<br />
<br />
1. ''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected onto a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, i.e. <math> max((\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}}))</math>. The steps of this maximization are given below. <br />
<br />
2. ''To collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances, i.e. <math> min(\underline{w}^T\Sigma_{1}\underline{w} + \underline{w}^T\Sigma_{2}\underline{w})</math><br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> <br />
<br />
Note that <math>\ z_i </math> is scalar.<br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math> which is scalar<br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
Our goal: <math>max (\underline{w}^T S_{B} \underline{w})</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math> and the covariance of class 2 is <math>\,\Sigma_{2}</math><br />
<br />
Thus, the covariance of the projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
Summing these two quantities, we get<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
Our goal: <math>min(\underline{w}^T S_{W} \underline{w})</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w} \equiv \max(\underline w^T S_B \underline w)</math> subject to the constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math> <br />
<br />
{{Cleanup|date= 2 October 2010|reason= 'is no upper bound' and 'is no lower bound' do not make sense to me. please make the correction if you understand what the previous author is trying to say}}<br />
<br />
where <math>\ \underline w^T S_B \underline w</math> is no upper bound and <math>\ \underline w^T S_w \underline w</math> is no lower bound<br />
<br />
We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math> where <math>\ \lambda </math> is the weight<br />
<br />
<br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue <math>\ \lambda </math>.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
{{Cleanup|date= 2 October 2010|reason= Personally, I feel another PCA vs FDA example is redundant but seeing how the previous was done in R and this one is in MatLab, maybe we should keep it? }}<br />
<br />
This time, we will use MatLab to compare PCA and FDA.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
From the graph, it can be observed that there is a huge overlap for the two classes using PCA where as there is hardly any overlap using FDA. FDA separates the two classes better than PCA in this example.<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
Map the data into a linear line, and the two classes are seperated perfectly here.<br />
<br />
==== An extension of Fisher's discriminant analysis for stochastic processes ====<br />
<br />
<br />
A general notion of Fisher's linear discriminant analysis can extend the classical multivariate concept to situations that allow for function-valued random elements. The development uses a bijective mapping that connects a second order process to the reproducing kernel Hilbert space generated by its within class covariance kernel. This approach provides a seamless transition between Fisher's original development and infinite dimensional settings that lends itself well to computation via smoothing and regularization. <br />
<br />
Link for Algorithm introduction:[[http://statgen.ncsu.edu/icsa2007/talks/HyejinShin.pdf]]<br />
<br />
=== Multi-class Problem ===<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
The within class covariance matrix <math>\mathbf{S}_{W}</math> can be easily obtained:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not so easy. Here, we will make a simplification<br />
that we assume the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant. Since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
There is another way to generate <math>\mathbf{S}_{B}</math>. <br />
<br />
Denote a total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class FDA problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
{{Cleanup|date= 2 Oct 2010|reason= I think the author is comparing this S_B to the previous S_B computed from the assumption...can someone please confirm?}}<br />
<br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class FDA problem, we have:<br />
<br />
<math> max(Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]) </math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
===Generalization of Fisher's Linear Discriminant ===<br />
<br />
{{Cleanup|date= 2 October 2010|reason= I am not sure how to interpret the last sentence in this paragraph}}<br />
<br />
Fisher's linear discriminant (Fisher, 1936) is very popular technique among users of discriminant analysis. Some of the reasons for this are its simplicity<br />
without strict assumptions. However it has optimality properties only if the underlying distributions of the groups are multivariate normal. It is also easy to verify that the discriminant rule obtained can be easily affected by only a small number of outlying observations. Outliers are very hard to detect in multivariate data sets and even when they are detected, simply discarding them is not the most efficient/appropriate way of handling the situation. Therefore there is a need for robust procedures that can accommodate the outliers. Then, a generalization of Fisher's linear discriminant algorithm [[http://www.math.ist.utl.pt/~apires/PDFs/APJB_RP96.pdf]] is developed to lead easily to a very robust procedure.<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6436stat841f102010-10-02T22:36:26Z<p>Hclam: /* Objective Function */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is useful in dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as the black line in the diagram (this is PCA)<br />
>> s <- svd(X,nu=1,nv=1)<br />
<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
>> s2 <- lda(X,grouping=Y)<br />
<br />
: Now that we've calculated the PCA and FLDA decompositions, we can create a plot to demonstrate the differences between the two algorithms. <br />
<br />
: Plot the set of points, according to colours given in Y.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
<br />
: Plot the FLDA direction, again through the mean.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
<br />
FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This<br />
approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. The paper, "[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] ", written by our instructor, proposes a closed-form solution to one algorithm that previously required expensive semidefinite optimization. It provides a new problem setup in which the algorithm performs better or as well as some standard methods, minus the complexity in computational. Furthermore, the paper demonstrates a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). It also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
=== Two-class Problem ===<br />
<br />
In the two-class problem, we have prior knowledge that the data points belong to two classes. Intuitively speaking, points from each class form a cloud around the mean of the class. Individual classes having possibly different size. To be able to separate the two classes, we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{1}</math> the covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{2}</math> the covariance of the 2nd class.<br />
<br />
We have to find a transformation which satisfies the following goals:<br />
<br />
1. ''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected onto a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, i.e. <math> max((\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}}))</math>. The steps of this maximization are given below. <br />
<br />
2. ''To collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances, i.e. <math> min(\underline{w}^T\Sigma_{1}\underline{w} + \underline{w}^T\Sigma_{2}\underline{w})</math><br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> <br />
<br />
Note that <math>\ z_i </math> is scalar.<br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math> which is scalar<br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
Our goal: <math>max (\underline{w}^T S_{B} \underline{w})</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math> and the covariance of class 2 is <math>\,\Sigma_{2}</math><br />
<br />
Thus, the covariance of the projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
Summing these two quantities, we get<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
Our goal: <math>min(\underline{w}^T S_{W} \underline{w})</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w} \equiv \max(\underline w^T S_B \underline w)</math> subject to the constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math> <br />
<br />
{{Cleanup|date= 2 October 2010|reason= 'is no upper bound' and 'is no lower bound' do not make sense to me. please make the correction if you understand what the previous author is trying to say}}<br />
<br />
where <math>\ \underline w^T S_B \underline w</math> is no upper bound and <math>\ \underline w^T S_w \underline w</math> is no lower bound<br />
<br />
We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math> where <math>\ \lambda </math> is the weight<br />
<br />
<br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue <math>\ \lambda </math>.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
This time, we will use MatLab to compare PCA and FDA.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
From the graph, it can be observed that there is a huge overlap for the two classes using PCA where as there is hardly any overlap using FDA. FDA separates the two classes better than PCA in this example.<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
Map the data into a linear line, and the two classes are seperated perfectly here.<br />
<br />
==== An extension of Fisher's discriminant analysis for stochastic processes ====<br />
<br />
<br />
A general notion of Fisher's linear discriminant analysis can extend the classical multivariate concept to situations that allow for function-valued random elements. The development uses a bijective mapping that connects a second order process to the reproducing kernel Hilbert space generated by its within class covariance kernel. This approach provides a seamless transition between Fisher's original development and infinite dimensional settings that lends itself well to computation via smoothing and regularization. <br />
<br />
Link for Algorithm introduction:[[http://statgen.ncsu.edu/icsa2007/talks/HyejinShin.pdf]]<br />
<br />
=== Multi-class Problem ===<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
The within class covariance matrix <math>\mathbf{S}_{W}</math> can be easily obtained:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not so easy. Here, we will make a simplification<br />
that we assume the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant. Since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
There is another way to generate <math>\mathbf{S}_{B}</math>. <br />
<br />
Denote a total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class FDA problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
{{Cleanup|date= 2 Oct 2010|reason= I think the author is comparing this S_B to the previous S_B computed from the assumption...can someone please confirm?}}<br />
<br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class FDA problem, we have:<br />
<br />
<math> max(Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]) </math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
===Generalization of Fisher's Linear Discriminant ===<br />
<br />
{{Cleanup|date= 2 October 2010|reason= I am not sure how to interpret the last sentence in this paragraph}}<br />
<br />
Fisher's linear discriminant (Fisher, 1936) is very popular technique among users of discriminant analysis. Some of the reasons for this are its simplicity<br />
without strict assumptions. However it has optimality properties only if the underlying distributions of the groups are multivariate normal. It is also easy to verify that the discriminant rule obtained can be easily affected by only a small number of outlying observations. Outliers are very hard to detect in multivariate data sets and even when they are detected, simply discarding them is not the most efficient/appropriate way of handling the situation. Therefore there is a need for robust procedures that can accommodate the outliers. Then, a generalization of Fisher's linear discriminant algorithm [[http://www.math.ist.utl.pt/~apires/PDFs/APJB_RP96.pdf]] is developed to lead easily to a very robust procedure.<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6435stat841f102010-10-02T22:36:00Z<p>Hclam: /* Multi-class Problem */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is useful in dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as the black line in the diagram (this is PCA)<br />
>> s <- svd(X,nu=1,nv=1)<br />
<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
>> s2 <- lda(X,grouping=Y)<br />
<br />
: Now that we've calculated the PCA and FLDA decompositions, we can create a plot to demonstrate the differences between the two algorithms. <br />
<br />
: Plot the set of points, according to colours given in Y.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
<br />
: Plot the FLDA direction, again through the mean.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
<br />
FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This<br />
approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. The paper, "[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] ", written by our instructor, proposes a closed-form solution to one algorithm that previously required expensive semidefinite optimization. It provides a new problem setup in which the algorithm performs better or as well as some standard methods, minus the complexity in computational. Furthermore, the paper demonstrates a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). It also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
=== Two-class Problem ===<br />
<br />
In the two-class problem, we have prior knowledge that the data points belong to two classes. Intuitively speaking, points from each class form a cloud around the mean of the class. Individual classes having possibly different size. To be able to separate the two classes, we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{1}</math> the covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{2}</math> the covariance of the 2nd class.<br />
<br />
We have to find a transformation which satisfies the following goals:<br />
<br />
1. ''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected onto a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, i.e. <math> max((\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}}))</math>. The steps of this maximization are given below. <br />
<br />
2. ''To collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances, i.e. <math> min(\underline{w}^T\Sigma_{1}\underline{w} + \underline{w}^T\Sigma_{2}\underline{w})</math><br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> <br />
<br />
Note that <math>\ z_i </math> is scalar.<br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math> which is scalar<br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
Our goal: <math>max (\underline{w}^T S_{B} \underline{w})</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math> and the covariance of class 2 is <math>\,\Sigma_{2}</math><br />
<br />
Thus, the covariance of the projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
Summing these two quantities, we get<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
Our goal: <math>min(\underline{w}^T S_{W} \underline{w})</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w} \equiv \max(\underline w^T S_B \underline w)</math> subject to the constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math> <br />
<br />
{{Cleanup|date= 29 September 2010|reason= 'is no upper bound' and 'is no lower bound' do not make sense to me. please make the correction if you understand what the previous author is trying to say}}<br />
<br />
where <math>\ \underline w^T S_B \underline w</math> is no upper bound and <math>\ \underline w^T S_w \underline w</math> is no lower bound<br />
<br />
We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math> where <math>\ \lambda </math> is the weight<br />
<br />
<br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue <math>\ \lambda </math>.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
This time, we will use MatLab to compare PCA and FDA.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
From the graph, it can be observed that there is a huge overlap for the two classes using PCA where as there is hardly any overlap using FDA. FDA separates the two classes better than PCA in this example.<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
Map the data into a linear line, and the two classes are seperated perfectly here.<br />
<br />
==== An extension of Fisher's discriminant analysis for stochastic processes ====<br />
<br />
<br />
A general notion of Fisher's linear discriminant analysis can extend the classical multivariate concept to situations that allow for function-valued random elements. The development uses a bijective mapping that connects a second order process to the reproducing kernel Hilbert space generated by its within class covariance kernel. This approach provides a seamless transition between Fisher's original development and infinite dimensional settings that lends itself well to computation via smoothing and regularization. <br />
<br />
Link for Algorithm introduction:[[http://statgen.ncsu.edu/icsa2007/talks/HyejinShin.pdf]]<br />
<br />
=== Multi-class Problem ===<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
The within class covariance matrix <math>\mathbf{S}_{W}</math> can be easily obtained:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not so easy. Here, we will make a simplification<br />
that we assume the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant. Since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
There is another way to generate <math>\mathbf{S}_{B}</math>. <br />
<br />
Denote a total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class FDA problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
{{Cleanup|date= 2 Oct 2010|reason= I think the author is comparing this S_B to the previous S_B computed from the assumption...can someone please confirm?}}<br />
<br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class FDA problem, we have:<br />
<br />
<math> max(Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]) </math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
===Generalization of Fisher's Linear Discriminant ===<br />
<br />
{{Cleanup|date= 2 October 2010|reason= I am not sure how to interpret the last sentence in this paragraph}}<br />
<br />
Fisher's linear discriminant (Fisher, 1936) is very popular technique among users of discriminant analysis. Some of the reasons for this are its simplicity<br />
without strict assumptions. However it has optimality properties only if the underlying distributions of the groups are multivariate normal. It is also easy to verify that the discriminant rule obtained can be easily affected by only a small number of outlying observations. Outliers are very hard to detect in multivariate data sets and even when they are detected, simply discarding them is not the most efficient/appropriate way of handling the situation. Therefore there is a need for robust procedures that can accommodate the outliers. Then, a generalization of Fisher's linear discriminant algorithm [[http://www.math.ist.utl.pt/~apires/PDFs/APJB_RP96.pdf]] is developed to lead easily to a very robust procedure.<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6434stat841f102010-10-02T22:20:53Z<p>Hclam: /* Two-class Problem */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is useful in dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as the black line in the diagram (this is PCA)<br />
>> s <- svd(X,nu=1,nv=1)<br />
<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
>> s2 <- lda(X,grouping=Y)<br />
<br />
: Now that we've calculated the PCA and FLDA decompositions, we can create a plot to demonstrate the differences between the two algorithms. <br />
<br />
: Plot the set of points, according to colours given in Y.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
<br />
: Plot the FLDA direction, again through the mean.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
<br />
FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This<br />
approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. The paper, "[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] ", written by our instructor, proposes a closed-form solution to one algorithm that previously required expensive semidefinite optimization. It provides a new problem setup in which the algorithm performs better or as well as some standard methods, minus the complexity in computational. Furthermore, the paper demonstrates a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). It also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
=== Two-class Problem ===<br />
<br />
In the two-class problem, we have prior knowledge that the data points belong to two classes. Intuitively speaking, points from each class form a cloud around the mean of the class. Individual classes having possibly different size. To be able to separate the two classes, we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{1}</math> the covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{2}</math> the covariance of the 2nd class.<br />
<br />
We have to find a transformation which satisfies the following goals:<br />
<br />
1. ''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected onto a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, i.e. <math> max((\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}}))</math>. The steps of this maximization are given below. <br />
<br />
2. ''To collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances, i.e. <math> min(\underline{w}^T\Sigma_{1}\underline{w} + \underline{w}^T\Sigma_{2}\underline{w})</math><br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> <br />
<br />
Note that <math>\ z_i </math> is scalar.<br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math> which is scalar<br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
Our goal: <math>max (\underline{w}^T S_{B} \underline{w})</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math> and the covariance of class 2 is <math>\,\Sigma_{2}</math><br />
<br />
Thus, the covariance of the projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
Summing these two quantities, we get<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
Our goal: <math>min(\underline{w}^T S_{W} \underline{w})</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w} \equiv \max(\underline w^T S_B \underline w)</math> subject to the constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math> <br />
<br />
{{Cleanup|date= 29 September 2010|reason= 'is no upper bound' and 'is no lower bound' do not make sense to me. please make the correction if you understand what the previous author is trying to say}}<br />
<br />
where <math>\ \underline w^T S_B \underline w</math> is no upper bound and <math>\ \underline w^T S_w \underline w</math> is no lower bound<br />
<br />
We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math> where <math>\ \lambda </math> is the weight<br />
<br />
<br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue <math>\ \lambda </math>.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
This time, we will use MatLab to compare PCA and FDA.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
From the graph, it can be observed that there is a huge overlap for the two classes using PCA where as there is hardly any overlap using FDA. FDA separates the two classes better than PCA in this example.<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
Map the data into a linear line, and the two classes are seperated perfectly here.<br />
<br />
==== An extension of Fisher's discriminant analysis for stochastic processes ====<br />
<br />
<br />
A general notion of Fisher's linear discriminant analysis can extend the classical multivariate concept to situations that allow for function-valued random elements. The development uses a bijective mapping that connects a second order process to the reproducing kernel Hilbert space generated by its within class covariance kernel. This approach provides a seamless transition between Fisher's original development and infinite dimensional settings that lends itself well to computation via smoothing and regularization. <br />
<br />
Link for Algorithm introduction:[[http://statgen.ncsu.edu/icsa2007/talks/HyejinShin.pdf]]<br />
<br />
=== Multi-class Problem ===<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6433stat841f102010-10-02T22:20:05Z<p>Hclam: /* Two-class Problem */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is useful in dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as the black line in the diagram (this is PCA)<br />
>> s <- svd(X,nu=1,nv=1)<br />
<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
>> s2 <- lda(X,grouping=Y)<br />
<br />
: Now that we've calculated the PCA and FLDA decompositions, we can create a plot to demonstrate the differences between the two algorithms. <br />
<br />
: Plot the set of points, according to colours given in Y.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
<br />
: Plot the FLDA direction, again through the mean.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
<br />
FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This<br />
approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. The paper, "[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] ", written by our instructor, proposes a closed-form solution to one algorithm that previously required expensive semidefinite optimization. It provides a new problem setup in which the algorithm performs better or as well as some standard methods, minus the complexity in computational. Furthermore, the paper demonstrates a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). It also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
=== Two-class Problem ===<br />
<br />
In the two-class problem, we have prior knowledge that the data points belong to two classes. Intuitively speaking, points from each class form a cloud around the mean of the class. Individual classes having possibly different size. To be able to separate the two classes, we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{1}</math> the covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{2}</math> the covariance of the 2nd class.<br />
<br />
We have to find a transformation which satisfies the following goals:<br />
<br />
1. ''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected onto a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, i.e. <math> max((\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}}))</math>. The steps of this maximization are given below. <br />
<br />
2. ''To collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances, i.e. <math> min(\underline{w}^T\Sigma_{1}\underline{w} + \underline{w}^T\Sigma_{2}\underline{w})</math><br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> <br />
<br />
Note that <math>\ z_i </math> is scalar.<br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math> which is scalar<br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
Our goal: <math>max (\underline{w}^T S_{B} \underline{w})</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math> and the covariance of class 2 is <math>\,\Sigma_{2}</math><br />
<br />
Thus, the covariance of the projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
Summing these two quantities, we get<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
Our goal: <math>min(\underline{w}^T S_{W} \underline{w})</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w} \equiv \max(\underline w^T S_B \underline w)</math> subject to the constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math> <br />
<br />
{{Cleanup|date= 29 September 2010|reason= 'is no upper bound' and 'is no lower bound' do not make sense to me. please make the correction if you understand what the previous author is trying to say}}<br />
<br />
where <math>\ \underline w^T S_B \underline w</math> is no upper bound and <math>\ \underline w^T S_w \underline w</math> is no lower bound<br />
<br />
We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math> where <math>\ \lambda </math> is the weight<br />
<br />
<br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue <math>\ \lambda </math>.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
This time, we will use MatLab to compare PCA and FDA.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
From the graph, it can be observed that there is a huge overlap for the two classes using PCA where as there is hardly any overlap using FDA. FDA separates the two classes better than PCA in this example.<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
Map the data into a linear line, and the two classes are seperated perfectly here.<br />
<br />
=== Multi-class Problem ===<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6432stat841f102010-10-02T21:57:54Z<p>Hclam: /* Objective Function */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is useful in dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as the black line in the diagram (this is PCA)<br />
>> s <- svd(X,nu=1,nv=1)<br />
<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
>> s2 <- lda(X,grouping=Y)<br />
<br />
: Now that we've calculated the PCA and FLDA decompositions, we can create a plot to demonstrate the differences between the two algorithms. <br />
<br />
: Plot the set of points, according to colours given in Y.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
<br />
: Plot the FLDA direction, again through the mean.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
<br />
FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This<br />
approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. The paper, "[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] ", written by our instructor, proposes a closed-form solution to one algorithm that previously required expensive semidefinite optimization. It provides a new problem setup in which the algorithm performs better or as well as some standard methods, minus the complexity in computational. Furthermore, the paper demonstrates a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). It also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
=== Two-class Problem ===<br />
<br />
In the two-class problem, we have prior knowledge that the data points belong to two classes. Intuitively speaking, points from each class form a cloud around the mean of the class. Individual classes having possibly different size. To be able to separate the two classes, we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{1}</math> the covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{2}</math> the covariance of the 2nd class.<br />
<br />
We have to find a transformation which satisfies the following goals:<br />
<br />
1. ''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected onto a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, i.e. <math> max((\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}}))</math>. The steps of this maximization are given below. <br />
<br />
2. ''To collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances, i.e. <math> min(\underline{w}^T\Sigma_{1}\underline{w} + \underline{w}^T\Sigma_{2}\underline{w})</math><br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> <br />
<br />
Note that <math>\ z_i </math> is scalar.<br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math> which is scalar<br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
Our goal: <math>max (\underline{w}^T S_{B} \underline{w})</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math> and the covariance of class 2 is <math>\,\Sigma_{2}</math><br />
<br />
Thus, the covariance of the projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
Summing these two quantities, we get<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
Our goal: <math>min(\underline{w}^T S_{W} \underline{w})</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w} \equiv \max(\underline w^T S_B \underline w)</math> subject to the constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math> <br />
<br />
{{Cleanup|date= 29 September 2010|reason= 'is no upper bound' and 'is no lower bound' do not make sense to me. please make the correction if you understand what the previous author is trying to say}}<br />
<br />
where <math>\ \underline w^T S_B \underline w</math> is no upper bound and <math>\ \underline w^T S_w \underline w</math> is no lower bound<br />
<br />
We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math> where <math>\ \lambda </math> is the weight<br />
<br />
<br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue <math>\ \lambda </math>.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
=== Multi-class Problem ===<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6431stat841f102010-10-02T21:55:09Z<p>Hclam: /* Two-class Problem */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is useful in dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as the black line in the diagram (this is PCA)<br />
>> s <- svd(X,nu=1,nv=1)<br />
<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
>> s2 <- lda(X,grouping=Y)<br />
<br />
: Now that we've calculated the PCA and FLDA decompositions, we can create a plot to demonstrate the differences between the two algorithms. <br />
<br />
: Plot the set of points, according to colours given in Y.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
<br />
: Plot the FLDA direction, again through the mean.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
<br />
FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This<br />
approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. The paper, "[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] ", written by our instructor, proposes a closed-form solution to one algorithm that previously required expensive semidefinite optimization. It provides a new problem setup in which the algorithm performs better or as well as some standard methods, minus the complexity in computational. Furthermore, the paper demonstrates a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). It also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
=== Two-class Problem ===<br />
<br />
In the two-class problem, we have prior knowledge that the data points belong to two classes. Intuitively speaking, points from each class form a cloud around the mean of the class. Individual classes having possibly different size. To be able to separate the two classes, we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{1}</math> the covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{2}</math> the covariance of the 2nd class.<br />
<br />
We have to find a transformation which satisfies the following goals:<br />
<br />
1. ''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected onto a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, i.e. <math> max((\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}}))</math>. The steps of this maximization are given below. <br />
<br />
2. ''To collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances, i.e. <math> min(\underline{w}^T\Sigma_{1}\underline{w} + \underline{w}^T\Sigma_{2}\underline{w})</math><br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> <br />
<br />
Note that <math>\ z_i </math> is scalar.<br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math> which is scalar<br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
Our goal: <math>max (\underline{w}^T S_{B} \underline{w})</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math> and the covariance of class 2 is <math>\,\Sigma_{2}</math><br />
<br />
Thus, the covariance of the projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
Summing these two quantities, we get<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
Our goal: <math>min(\underline{w}^T S_{W} \underline{w})</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w} \equiv \max(\underline w^T S_B \underline w)</math> subject to the constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math> <br />
where <math>\ \underline w^T S_B \underline w</math> is no upper bound and <math>\ \underline w^T S_w \underline w</math> is no lower bound<br />
<br />
We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math> where <math>\ \lambda </math> is the weight<br />
<br />
<br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue <math>\ \lambda </math>.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
=== Multi-class Problem ===<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6430stat841f102010-10-02T21:46:44Z<p>Hclam: /* Two-class Problem */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is useful in dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as the black line in the diagram (this is PCA)<br />
>> s <- svd(X,nu=1,nv=1)<br />
<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
>> s2 <- lda(X,grouping=Y)<br />
<br />
: Now that we've calculated the PCA and FLDA decompositions, we can create a plot to demonstrate the differences between the two algorithms. <br />
<br />
: Plot the set of points, according to colours given in Y.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
<br />
: Plot the FLDA direction, again through the mean.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
<br />
FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This<br />
approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. The paper, "[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] ", written by our instructor, proposes a closed-form solution to one algorithm that previously required expensive semidefinite optimization. It provides a new problem setup in which the algorithm performs better or as well as some standard methods, minus the complexity in computational. Furthermore, the paper demonstrates a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). It also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
=== Two-class Problem ===<br />
<br />
In the two-class problem, we have prior knowledge that the data points belong to two classes. Intuitively speaking, points from each class form a cloud around the mean of the class. Individual classes having possibly different size. To be able to separate the two classes, we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{1}</math> the covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{2}</math> the covariance of the 2nd class.<br />
<br />
We have to find a transformation which satisfies the following goals:<br />
<br />
1. ''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected onto a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, i.e. <math> max((\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}}))</math>. The steps of this maximization are given below. <br />
<br />
2. ''To collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances, i.e. <math> min(\underline{w}^T\Sigma_{1}\underline{w} + \underline{w}^T\Sigma_{2}\underline{w})</math><br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> <br />
<br />
Note that <math>\ z_i </math> is scalar.<br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math> which is scalar<br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
Our goal: <math>max (\underline{w}^T S_{B} \underline{w})</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math> and the covariance of class 2 is <math>\,\Sigma_{2}</math><br />
<br />
Thus, the covariance of the projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
Summing these two quantities, we get<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
Our goal: <math>min(\underline{w}^T S_{W} \underline{w})</math><br />
<br />
=== Multi-class Problem ===<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6429stat841f102010-10-02T21:44:33Z<p>Hclam: /* Two-class Problem */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is useful in dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as the black line in the diagram (this is PCA)<br />
>> s <- svd(X,nu=1,nv=1)<br />
<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
>> s2 <- lda(X,grouping=Y)<br />
<br />
: Now that we've calculated the PCA and FLDA decompositions, we can create a plot to demonstrate the differences between the two algorithms. <br />
<br />
: Plot the set of points, according to colours given in Y.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
<br />
: Plot the FLDA direction, again through the mean.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
<br />
FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This<br />
approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. The paper, "[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] ", written by our instructor, proposes a closed-form solution to one algorithm that previously required expensive semidefinite optimization. It provides a new problem setup in which the algorithm performs better or as well as some standard methods, minus the complexity in computational. Furthermore, the paper demonstrates a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). It also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
=== Two-class Problem ===<br />
<br />
In the two-class problem, we have prior knowledge that the data points belong to two classes. Intuitively speaking, points from each class form a cloud around the mean of the class. Individual classes having possibly different size. To be able to separate the two classes, we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{1}</math> the covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{2}</math> the covariance of the 2nd class.<br />
<br />
We have to find a transformation which satisfies the following goals:<br />
<br />
1. ''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected onto a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, i.e. <math> max((\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}}))</math>. The steps of this maximization are given below. <br />
<br />
2. ''To collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances, i.e. <math> min(\underline{w}^T\Sigma_{1}\underline{w} + \underline{w}^T\Sigma_{2}\underline{w})</math><br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> <br />
<br />
Note that <math>\ z_i </math> is scalar.<br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math> which is scalar<br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
Our goal: <math>max (\underline{w}^T S_{B} \underline{w})</math><br />
<br />
=== Multi-class Problem ===<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6428stat841f102010-10-02T21:41:20Z<p>Hclam: /* Two-class Problem */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is useful in dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as the black line in the diagram (this is PCA)<br />
>> s <- svd(X,nu=1,nv=1)<br />
<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
>> s2 <- lda(X,grouping=Y)<br />
<br />
: Now that we've calculated the PCA and FLDA decompositions, we can create a plot to demonstrate the differences between the two algorithms. <br />
<br />
: Plot the set of points, according to colours given in Y.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
<br />
: Plot the FLDA direction, again through the mean.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
<br />
FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This<br />
approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. The paper, "[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] ", written by our instructor, proposes a closed-form solution to one algorithm that previously required expensive semidefinite optimization. It provides a new problem setup in which the algorithm performs better or as well as some standard methods, minus the complexity in computational. Furthermore, the paper demonstrates a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). It also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
=== Two-class Problem ===<br />
<br />
In the two-class problem, we have prior knowledge that the data points belong to two classes. Intuitively speaking, points from each class form a cloud around the mean of the class. Individual classes having possibly different size. To be able to separate the two classes, we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{1}</math> the covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> represents the mean and <math>\displaystyle\Sigma_{2}</math> the covariance of the 2nd class.<br />
<br />
We have to find a transformation which satisfies the following goals:<br />
<br />
1. ''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected onto a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, i.e. <math> max((\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}}))</math>. The steps of this maximization are given below. <br />
<br />
2. ''To collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances, i.e. <math> min(\underline{w}^T\Sigma_{1}\underline{w} + \underline{w}^T\Sigma_{2}\underline{w})</math><br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /> <math>\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \} </math><br />
<br />
<br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math> <br />
<br />
Note that <math>\ z_i </math> is scalar.<br />
<br />
=== Multi-class Problem ===<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6427stat841f102010-10-02T21:28:15Z<p>Hclam: /* Distance Metric Learning VS FDA */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is useful in dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as the black line in the diagram (this is PCA)<br />
>> s <- svd(X,nu=1,nv=1)<br />
<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
>> s2 <- lda(X,grouping=Y)<br />
<br />
: Now that we've calculated the PCA and FLDA decompositions, we can create a plot to demonstrate the differences between the two algorithms. <br />
<br />
: Plot the set of points, according to colours given in Y.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
<br />
: Plot the FLDA direction, again through the mean.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
<br />
FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This<br />
approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. The paper, "[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] ", written by our instructor, proposes a closed-form solution to one algorithm that previously required expensive semidefinite optimization. It provides a new problem setup in which the algorithm performs better or as well as some standard methods, minus the complexity in computational. Furthermore, the paper demonstrates a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). It also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
=== Two-class Problem ===<br />
<br />
=== Multi-class Problem ===<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6426stat841f102010-10-02T21:24:57Z<p>Hclam: /* Distance Metric Learning VS FDA */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is useful in dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as the black line in the diagram (this is PCA)<br />
>> s <- svd(X,nu=1,nv=1)<br />
<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
>> s2 <- lda(X,grouping=Y)<br />
<br />
: Now that we've calculated the PCA and FLDA decompositions, we can create a plot to demonstrate the differences between the two algorithms. <br />
<br />
: Plot the set of points, according to colours given in Y.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
<br />
: Plot the FLDA direction, again through the mean.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
<br />
FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This<br />
approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. In the paper,"[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] " written by our instructor, they propose a closed-form solution to one algorithm that previously required expensive semidefinite optimization. They provide a new problem setup in which the algorithm performs better or as well as some standard methods, but without the computational complexity. Furthermore, they show a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). They also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
=== Two-class Problem ===<br />
<br />
=== Multi-class Problem ===<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6425stat841f102010-10-02T21:24:03Z<p>Hclam: /* Fisher's Discriminant Analysis */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is useful in dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as the black line in the diagram (this is PCA)<br />
>> s <- svd(X,nu=1,nv=1)<br />
<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
>> s2 <- lda(X,grouping=Y)<br />
<br />
: Now that we've calculated the PCA and FLDA decompositions, we can create a plot to demonstrate the differences between the two algorithms. <br />
<br />
: Plot the set of points, according to colours given in Y.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
<br />
: Plot the FLDA direction, again through the mean.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
<br />
FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
<br />
=== Distance Metric Learning VS FDA ===<br />
In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This<br />
approach can be equivalently interpreted as a linear transformation of the original inputs,followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.<br />
<br />
Some of the proposed algorithms are iterative and computationally expensive. In the paper,"[http://www.aaai.org/Papers/AAAI/2008/AAAI08-095.pdf Distance Metric Learning VS FDA] " written by our instructor, they propose a closed-form solution to one algorithm that previously required expensive semidefinite optimization. They provide a new problem setup in which the algorithm performs better or as well as some standard methods, but without the computational complexity. Furthermore, they show a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). They also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.<br />
<br />
=== Two-class Problem ===<br />
<br />
=== Multi-class Problem ===<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6424stat841f102010-10-02T21:20:00Z<p>Hclam: /* Example in R */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is useful in dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as the black line in the diagram (this is PCA)<br />
>> s <- svd(X,nu=1,nv=1)<br />
<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
>> s2 <- lda(X,grouping=Y)<br />
<br />
: Now that we've calculated the PCA and FLDA decompositions, we can create a plot to demonstrate the differences between the two algorithms. <br />
<br />
: Plot the set of points, according to colours given in Y.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
<br />
: Plot the FLDA direction, again through the mean.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
<br />
FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6423stat841f102010-10-02T21:09:56Z<p>Hclam: /* Fisher's Discriminant Analysis */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is useful in dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6422stat841f102010-10-02T21:08:34Z<p>Hclam: /* Fisher's Discriminant Analysis */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
{{Cleanup|date=October 2010|reason=you are right, I changed it, too.}}<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.075}=\frac{1}{3}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=October 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. It collapses data to lower dimensions and maximizes separation between classes. FDA is commonly used for dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6418stat841f102010-10-02T21:04:40Z<p>Hclam: /* Description of FDA */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=October 2nd 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.25*0.5}{0.25*0.5+0.2*0.5}=\frac{0.025}{0.125}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=September 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
{{Cleanup|date=October 2nd 2010|reason=It must be noted that we don't do QDA with LDA and if we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA }}<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. FDA is commonly used for dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Rough Description of FDA===<br />
Consider a simple case of FDA involving two classes of data in two dimensions. In theory, FDA attempts to collapse all the data points in each class onto one point on some project line (one dimensional; a linear combination of components X1 and X2) while maximizing the distance between the two points. In effect, this creates a well defined separation between the two classes, allowing us to classify the data sets. In practice, it is generally not possible to collapse all data points in one class to a single point. We will instead make the data points in individual classes close to each other while simultaneously far from the other classes.<br />
<br />
===Example in R===<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6414stat841f102010-10-02T20:41:33Z<p>Hclam: /* Fisher's Discriminant Analysis */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=September 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=September 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.25*0.5}{0.25*0.5+0.2*0.5}=\frac{0.025}{0.125}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=September 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. FDA is commonly used for dimensional reduction and classification. <br />
<br />
===Contrasting FDA with PCA===<br />
The two feature extraction technique we will be studying in this class are FDA and Principal Component Analysis (PCA). These two differ in that<br />
* PCA maps data to lower dimensions to maximize the variation in those dimensions<br />
* FDA maps data to lower dimensions to best separate data in different classes<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
As noted in the diagram, FDA maximizes separation between different classes of data and is therefore a better feature extraction algorithm for classification.<br />
<br />
Note that FDA is a supervised algorithm, that is, we have prior knowledge of the associations between the data and the different classes, and we exploit that knowledge to find a good projection to lower dimensions. PCA is not a supervised algorithm.<br />
<br />
===Description of FDA===<br />
<br />
<br />
<br />
<br />
<br />
===Example in R===<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6413stat841f102010-10-02T20:18:26Z<p>Hclam: </p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=September 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=September 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.25*0.5}{0.25*0.5+0.2*0.5}=\frac{0.025}{0.125}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=September 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
<br />
==Fisher's Discriminant Analysis== <br />
[http://en.wikipedia.org/wiki/Fisher_discriminant_analysis Fisher's Discriminant Analysis] (FDA) is a feature extraction technique that separates two or more classes of objects. FDA is commonly used for dimensional reduction and classification. <br />
<br />
<br />
<br />
===Description of FDA===<br />
<br />
<br />
===Contrasting FDA with PCA===<br />
<br />
<br />
===Example in R===<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f10&diff=6412stat841f102010-10-02T19:33:15Z<p>Hclam: /* Bayes Classifier */</p>
<hr />
<div>==[[statf10841Scribe|Editor sign up]] ==<br />
== ''' Classfication-2010.09.21''' ==<br />
<br />
=== Classification ===<br />
{{Cleanup|date= 29 September 2010|reason= Each topic should begin with a short summary as a separated section. Please write a digest for each topic}}<br />
<br />
<br />
<br />
{{Cleanup|date= 28 September 2010|reason= I think there are different viewes. Some people name the supervised learning, "classification" and the unsupervised learning,"clustering" . But the other ones are just calling the clustering, "unsupervised classification", like http://www.yale.edu/ceo/Projects/swap/landcover/Unsupervised_classification.htm and other groups which could be found by a google search. Finally, I think It is good to be cleared here. }}<br />
'''Statistical classification''', or simply known as classification, is an area of [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A [http://en.wikipedia.org/wiki/Classifier_%28mathematics%29 classifier] is a specific technique or method for performing classification.<br />
To classify new data, a classifier first uses labeled (classes are known) [http://en.wikipedia.org/wiki/Training_set training data] to [http://en.wikipedia.org/wiki/Mathematical_model#Training train] a model, and then it uses a function known as its [http://en.wikipedia.org/wiki/Decision_rule classification rule] to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.<br />
<br />
Classification has been an important task for people and society since the beginnings of history. According to [http://www.schools.utah.gov/curr/science/sciber00/7th/classify/sciber/history.htm this link], the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which one were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being [http://en.wikipedia.org/wiki/Regression_analysis regression regression], [http://en.wikipedia.or/wiki/Regression_analysis clustering], and [http://en.wikipedia.org/wiki/Regression_analysis dimensionality reduction] (feature extraction or manifold learning). <br />
<br />
In '''classical statistics''', classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When [http://en.wikipedia.org/wiki/Machine_learning machine learning] was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers.<br />
<br />
''"We are drowning in information and starving for knowledge."'' <br />
- Rutherford D. Rogers<br />
<br />
In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.<br />
<br />
The formal mathematical definition of classification is as follows:<br />
<br />
'''Definition''': Classification is the prediction of a discrete [http://en.wikipedia.org/wiki/Random_variable random variable] <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math>, where <math> \mathcal{Y} </math> represents the label assigned to a new data input and <math> \mathcal{X} </math> represents the known feature values of the input. <br />
<br />
A set of training data used by a classifier to train its model consists of <math>\,n</math> [http://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables independently and identically distributed (i.i.d)] ordered pairs <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math>, where the values of the <math>\,ith</math> training input's feature values <math>\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}</math> is a ''d''-dimensional vector and the label of the <math>\, ith</math> training input is <math>\,Y_{i} \in \mathcal{Y} </math> that takes a finite number of values. The classification rule used by a classifier has the form <math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math>. After the model is trained, each new data input whose feature values is <math>\,x</math> is given the label <math>\,\hat{Y}=h(x)</math>.<br />
<br />
As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.<br />
<br />
[[File:Data1.jpg]]<br />
<br />
After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule <math>\ h </math> to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.<br />
<br />
[[File:Data3.jpg]]<br />
<br />
As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data <math>\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}</math>, we could then use the classifier's classification rule <math>\,h</math> to assign any newly-given fruit having known feature values <math>\,x = (\,x_{color}, x_{diameter} , x_{weight})</math> the label <math>\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}</math>.<br />
<br />
=== Error rate ===<br />
{{Cleanup|date=September 2010|reason=It is important to notice why do we use empirical error rate instead of true error rate and why do we define it. The main reason is that in experimental tasks we can't measure true error rate and we estimate it by empirical error rate which is unbiased estimation of true error rate }}<br />
The '''true error rate'''' <math>\,L(h)</math> of a classifier having classification rule <math>\,h</math> is defined as the probability that <math>\,h</math> does not correctly classify any new data input, i.e., it is defined as <math>\,L(h)=P(h(X) \neq Y)</math>. Here, <math>\,X \in \mathcal{X}</math> and <math>\,Y \in \mathcal{Y}</math> are the known feature values and the true class of that input, respectively. <br />
<br />
The '''empirical error rate''' (or '''training error rate''') of a classifier having classification rule <math>\,h</math> is defined as the frequency at which <math>\,h</math> does not correctly classify the data inputs in the training set, i.e., it is defined as<br />
<math>\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator variable and <math>\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.</math>. Here, <br />
<math>\,X_{i} \in \mathcal{X}</math> and <math>\,Y_{i} \in \mathcal{Y}</math> are the known feature values and the true class of the <math>\,ith</math> training input, respectively.<br />
<br />
=== Bayes Classifier ===<br />
<br />
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong [http://en.wikipedia.org/wiki/Naive_Bayes_classifier (naive)] independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".<br />
<br />
In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.<br />
<br />
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a [http://en.wikipedia.org/wiki/Supervised_learning supervised learning] setting. In many practical applications, parameter estimation for Bayes models uses the method of [http://en.wikipedia.org/wiki/Maximum_likelihood maximum likelihood]; in other words, one can work with the naive Bayes model without believing in [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian probability] or using any Bayesian methods.<br />
<br />
In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable [http://en.wikipedia.org/wiki/Efficacy efficacy] of Bayes classifiers.[1] Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as [http://en.wikipedia.org/wiki/Boosted_trees boosted trees] or [http://en.wikipedia.org/wiki/Random_forests random forests].[2]<br />
<br />
An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire [http://en.wikipedia.org/wiki/Covariance_matrix covariance matrix].<br />
<br />
After training its model using training data, the '''Bayes classifier''' classifies any new data input in two steps. First, it uses the input's known feature values and the [http://en.wikipedia.org/wiki/Bayes_formula Bayes formula] to calculate the input's [http://en.wikipedia.org/wiki/Posterior_probability posterior probability] of belonging to each class. Then, it uses its classification rule to place the input into its most-probable class, which is the one associated with the input's largest posterior probability. <br />
<br />
In mathematical terms, for a new data input having feature values <math>\,(X = x)\in \mathcal{X}</math>, the Bayes classifier labels the input as <math>(Y = y) \in \mathcal{Y}</math>, such that the input's posterior probability <math>\,P(Y = y|X = x)</math> is maximum over all of the members of <math>\mathcal{Y}</math>.<br />
<br />
Suppose there are <math>\,k</math> classes and we are given a new data input having feature values <math>\,x</math>. The following derivation shows how the Bayes classifier finds the input's posterior probability <math>\,P(Y = y|X = x)</math> of belonging to each class in <math>\mathcal{Y}</math>. <br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
Here, <math>\,P(Y=y|X=x)</math> is known as the posterior probability as mentioned above, <math>\,P(Y=y)</math> is known as the prior probability, <math>\,P(X=x|Y=y)</math> is known as the likelihood, and <math>\,P(X=x)</math> is known as the evidence.<br />
<br />
In the special case where there are two classes, i.e., <math>\, \mathcal{Y}=\{0, 1\}</math>, the Bayes classifier makes use of the function <math>\,r(x)=P\{Y=1|X=x\}</math> which is the prior probability of a new data input having feature values <math>\,x</math> belonging to the class <math>\,Y = 1</math>. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates <math>\,r(x)</math> as follows: <br />
:<math><br />
\begin{align}<br />
r(x)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
The Bayes classifier's classification rule <math>\,h^*: \mathcal{X} \mapsto \mathcal{Y}</math>, then, is <br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
Here, <math>\,x</math> is the feature values of a new data input and <math>\hat r(x)</math> is the estimated value of the function <math>\,r(x)</math> given by the Bayes classifier's model after feeding <math>\,x</math> into the model. Still in this special case of two classes, the Bayes classifier's [http://en.wikipedia.org/wiki/Decision_boundary decision boundary] is defined as the set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>. The decision boundary <math>\,D(h)</math> essentially combines together the trained model and the decision function <math>\,h</math>, and it is used by the Bayes classifier to assign any new data input to a label of either <math>\,Y = 0</math> or <math>\,Y = 1</math> depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as<br />
<br />
:<math>\, h^*(x)= \left\{\begin{matrix} <br />
1 &\text{if } P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>. <br />
<br />
'''Bayes Classification Rule Optimality Theorem''' <br />
:The Bayes classifier is the optimal classifier in that it produces the least possible probability of misclassification for any given new data input, i.e., for any other classifier having classification rule <math>\,h</math>, it is always true that <math>\,L(h^*(x)) \le L(h(x))</math>. Here, <math>\,L</math> represents the true error rate, <math>\,h^*</math> is the Bayes classifier's classification rule, and <math>\,x</math> is any given data input's feature values. <br />
<br />
Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief, as a result, their estimated values in the trained model may deviate quite a bit from their true population values and this ultimately can cause the posterior probabilities to deviate quite a bit from their true population values. A rather detailed proof of this theorem is available [http://www.ee.columbia.edu/~vittorio/BayesProof.pdf here]. <br />
{{Cleanup|date=September 2010|reason=It must be noted if Bayes classifier is optimal why do we need to other classifiers like neural networks. The main reason is that while this method is optimal and simple it needs a lot of information if we want to implement it. We need to have the class conditional distribution, which is hard to have it and needs estimation. Basically in real applications we assume it to be a known distribution based on the problem but it is not easy in general to have the distribution of the process. We also need to know the prior and marginal probabilities, while they are more simple to get but add complexity to our problem. For this reason other classifiers have been developed. While they are not optimal but can be used more easily and need less information about the process. }}<br />
<br />
'''Defining the classification rule:'''<br />
<br />
In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule <math>\,h</math>:<br />
<br />
:1) Empirical Risk Minimization: Choose a set of classifiers <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of the true error rate <math>\,L(h)</math>.<br />
<br />
:2) Regression: Find an estimate <math> \hat r </math> of the function <math> x </math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
:3) Density Estimation: Estimate <math>\,P(X=x|Y=0)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 0</math>, estimate <math>\,P(X=x|Y=1)</math> from the <math>\,X_{i}</math>'s for which <math>\,Y_{i} = 1</math>, and estimate <math>\,P(Y = 1)</math> as <math>\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}</math>. Then, calculate <math>\,\hat r(x) = \hat P(Y=1|X=x)</math> and define <br />
:<math>\, h(x)= \left\{\begin{matrix} <br />
1 &\text{if } \hat r(x)>\frac{1}{2} \\ <br />
0 &\text{if } \mathrm{otherwise} \end{matrix}\right.</math>.<br />
<br />
Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two. <br />
<br />
'''Multi-class classification:'''<br />
<br />
Suppose there are <math>\,k</math> classes, where <math>\,k \ge 2</math>.<br />
<br />
In the above discussion, we introduced the ''Bayes formula'' for this general case:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)}<br />
\end{align}<br />
</math><br />
<br />
which can re-worded as:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i}<br />
\end{align}<br />
</math><br />
Here, <math>\,f_y(x) = P(X=x|Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function] and <math>\,\pi_y = P(Y=y)</math> is known as the [http://en.wikipedia.org/wiki/Prior_probability prior probability]. <br />
<br />
In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values <math>\,x</math> into one of the <math>\,k</math> classes.<br />
<br />
''Theorem''<br />
: Suppose that <math> \mathcal{Y}= \{1, \dots, k\}</math>, where <math>\,k \ge 2</math>. Then, the optimal classification rule is <math>\,h^*(x) = arg max_{i} P(Y=i|X=x)</math>, where <math>\,i \in \{1, \dots, k\}</math>. <br />
<br />
'''Example:'''<br />
We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by <math>\, \mathcal{Y}\in \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Suppose that the prior probabilities are estimated or guessed to be <math>\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5</math>. We have data on past student performances, which we shall use to train the model. For each student, we know the following:<br />
:Whether or not the student’s GPA was greater than 3.0 (G).<br />
:Whether or not the student had a strong math background (M).<br />
:Whether or not the student was a hard worker (H).<br />
:Whether or not the student passed or failed the course.<br />
<br />
These known data are summarized in the following tables:<br />
<br />
{{Cleanup|date=September 2010|reason=These tables should be checked over to make sure that they are correct for the numerical calculation that follows}} <br />
<br />
:[[File:裁剪.jpg]]<br />
<br />
For each student, his/her feature values is <math>\, x = \{G, M, H\} </math> and his or her class is <math>\, y \in \{0, 1\} </math>.<br />
<br />
Suppose there is a new student having feature values <math>\, x = \{0, 1, 0\}</math>, and we would like to predict whether he/she would pass the course. <math>\,\hat r(x)</math> is found as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The following calculation needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculation is correct and I write it more detailed}}<br />
{{Cleanup|date=October 2010|reason=I disagree, shouldn't the numbers be 0.05*0.5/(0.2*0.5+0.05*0.5) = 0.025/0.075 = 1/3?}}<br />
<br />
<br /><br />
<math>\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.25*0.5}{0.25*0.5+0.2*0.5}=\frac{0.025}{0.125}=0.2<\frac{1}{2}.</math><br /><br />
<br />
The Bayes classifier assigns the new student into the class <math>\, h^*(x)=0 </math>. Therefore, we predict that the new student would fail the course.<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
The [http://en.wikipedia.org/wiki/Bayesian_probability Bayesian] view of probability and the [http://en.wikipedia.org/wiki/Frequency_probability frequentist] view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event. <br />
<br />
The Bayesian view of probability states that, for any event E, event E has a '''prior probability''' that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a ''belief'' that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability, of event E's occurrence can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no ''intrinsic'' probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, <math>\,50%</math> for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to [http://en.wikipedia.org/wiki/Thomas_Bayes Thomas Bayes] (1702–1761).<br />
<br />
In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an ''intrinsic'' probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined [http://en.wikipedia.org/wiki/Independence_%28probability_theory%29 independent] [http://en.wikipedia.org/wiki/Random random] [http://en.wikipedia.org/wiki/Experiments trials]. In each trial for an event, the event either occurs or it does not occur. Suppose <br />
<math>n_x</math> denotes the number of times that an event occurs during its trials and <math>n_t</math> denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the ''long run'', where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :<math>P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}.</math>. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as <math>P(x) \approx \frac{n_x}{n_t}</math>.If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow, and this is because one cannot possibly carry out trials on any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher [http://en.wikipedia.org/wiki/Aristotle Aristotle]. In his work [http://en.wikipedia.org/wiki/Rhetoric_%28Aristotle%29 ''Rhetoric''], Aristotle gave the famous line "'''''the probable is that which for the most part happens'''''".<br />
<br />
More information regarding the Bayesian and the frequentist schools of thought are available [http://www.statisticalengineering.com/frequentists_and_bayesians.htm here]. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available [http://www.youtube.com/watch?v=hLKOKdAircA here].<br />
<br />
== '''Linear and Quadratic Discriminant Analysis''' ==<br />
First, we shall limit ourselves to the case where there are two classes, i.e. <math>\, \mathcal{Y}=\{0, 1\}</math>. In the above discussion, we introduced the Bayes classifier's ''decision boundary'' <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math>, which represents a separating [http://en.wikipedia.org/wiki/Hyperplane hyperplane] that determines the class of any new data input depending on which side of the decision boundary the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear discriminant analysis (LDA)] and [http://en.wikipedia.org/wiki/Quadratic_classifier#Quadratic_discriminant_analysis quadratic discriminant analysis (QDA)] are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and shall look at each of them in turn.<br />
<br />
Let us denote the likelihood <math>\ P(X=x|Y=y) </math> as <math>\ f_y(x) </math> and the prior probability <math>\ P(Y=y) </math> as <math>\ \pi_y </math>.<br />
<br />
First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that the data from each of the two classes are generated from a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal (Gaussian) distribution] and that the two classes have the same covariance matrix <math>\, \Sigma</math>. Under the assumptions of LDA, we have: <math>\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma_k^{-1} (x - \mu_k) \right)</math>. Now, to derive the Bayes classifier's decision boundary using LDA, we equate <math>\, P(Y=1|X=x) </math> and <math>\, P(Y=0|X=x) </math> and proceed from there. The derivation of the decision boundary is as follows:<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The derivation of the Bayes classifier's decision boundary in the two-classes case needs to be checked over since I am not sure if it is entirely correct}} <br />
{{Cleanup|date=September 2010|reason=It seems that the calculations are correct. Please note which part do you think so that needs to be checked or rewritten}}<br />
<br />
<br />
:<math>\,Pr(Y=1|X=x)=Pr(Y=0|X=x)</math><br />
:<math>\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}</math> (using Bayes' Theorem)<br />
:<math>\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)</math> (canceling the denominators)<br />
:<math>\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0</math><br />
:<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math><br />
:<math>\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0</math> <br />
:<math>\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)</math> (taking the log of both sides).<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0</math> (expanding out)<br />
<br />
:<math>\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1}<br />
\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0</math> (canceling out like terms and factoring).<br />
<br />
:<math>\,\Rightarrow -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> (multiplying both sides by -2)<br />
<br />
<math>\, -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0)=0</math> is the Bayes classifier's decision boundary in the two-classes case. This decision boundary is linear in <math>\ x</math>, i.e., it is a hyperplane of the form <math>\,ax+b=0</math> where ''a'' and ''b'' are constants. Here, ''a'' <math>\, = - 2\Sigma^{-1}(\mu_1-\mu_0)</math> and ''b'' <math>\, = -2\log(\frac{\pi_1}{\pi_0})+\mu_1^\top\Sigma^{-1}\mu_1-\mu_0^\top\Sigma^{-1}\mu_0</math>.<br />
<br />
Not surprisingly, the Bayes's classifier's decision boundary being linear in <math>\ x</math> under the assumptions of LDA is where the word ''linear'' in linear discriminant analysis comes from.<br />
<br />
<br />
LDA under the two-classes case can easily be generalized to the general case where there are <math>\,k \ge 2</math> classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes <math>\,m </math> and <math>\,n</math>, then all we need to do is follow a derivation very similar to the one shown above, except with the classes <math>\,1 </math> and <math>\,0</math> being replaced by the classes <math>\,m </math> and <math>\,n</math>. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary between classes <math>\,m </math> and <math>\,n</math> to be <math>\, -2\log(\frac{\pi_m}{\pi_n})+\mu_m^\top\Sigma^{-1}\mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n)=0</math>. For any two classes <math>\,m </math> and <math>\,n</math> for whom we would like to find the Bayes classifier's decision boundary using LDA, if <math>\,m </math> and <math>\,n</math> both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between <math>\,m </math> and <math>\,n</math>.<br />
<br />
<br />
The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in [http://www.outguess.org/detection.php this link]:<br />
<br />
<br />
Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice , where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.<br />
<br />
<br />
{{Cleanup|date=September 2010|reason=The second and third limitations should be checked for their validity}} <br />
<br />
<br />
Some of the limitations of LDA include:<br />
<br />
* LDA implicitly assumes that each class has a Gaussian distribution.<br />
* LDA implicitly assumes that the mean rather than the variance is the discriminating factor.<br />
* LDA may over-fit the training data.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - 2010.09.23''' ==<br />
<br />
===Lecture Summary ===<br />
<br />
In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors <math>\Pr(Y=k|X=x)</math> we have optimal classification. He also shows that by assuming that the classes have common covariance matrix <math>\Sigma_{k}=\Sigma \forall k </math> the decision boundary between classes <math>k</math> and <math>l</math> is linear (LDA). However, if we do not assume same covariance between the two classes the decision boundary is quadratic function (QDA). <br />
<br />
The following [http://www.mathworks.com/help/toolbox/stats/classify.html MATLAB examples] can be used to demonstrated LDA and QDA.<br />
<br />
===LDA x QDA===<br />
<br />
Linear discriminant analysis[http://en.wikipedia.org/wiki/Linear_discriminant_analysis] is a statistical method used to find the ''linear combination'' of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. <br />
<br />
Quadratic Discriminant Analysis[http://en.wikipedia.org/wiki/Quadratic_classifier], on the other hand, aims to find the ''quadratic combination'' of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(x) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
{{Cleanup|date=September 2010|reason=Not very clear what is meant by the set of k. Perhaps it should be the class k?}} <br />
{{Cleanup|date=September 2010|reason=It seems that the author meant index k which is related to one of the classes or briefly Kth class. }}<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
Common covariance is defined by the average sample covariance. <br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
This is a Maximum Likelihood estimate.<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math><br />
<br />
[[File:case1.jpg|300px|thumb|right]] <br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero since <math>\ |I| </math> is the determine and <math>\ |I|=1 </math>. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the [http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Parameters/Euclidean_and_Euclidean_Squared_Distance_Metrics.htm squared Euclidean distance] between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
[http://portal.acm.org/citation.cfm?id=1340851 Kernel QDA]<br />
In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== Trick: Using LDA to do QDA ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \mathbb{R}^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
=== LDA and QDA in Matlab ===<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
{{Cleanup|date=September 2010|reason=Perhaps the first line should be plot (sample(1:200,1), sample(1:200,2), 'b.');?}} <br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
'''useful resouces:'''<br />
LDA and QDA in Matlab[http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/classdemo.html],[http://www.mathworks.com/matlabcentral/fileexchange/189],[http://seed.ucsd.edu/~cse190/media07/MatlabClassificationDemo.pdf]<br />
<br />
== '''Reference''' ==<br />
1. Harry Zhang. ''The optimality of naive bayes''. FLAIRS Conference. AAAI Press, 2004<br />
<br />
2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.<br />
<br />
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman [http://www-stat.stanford.edu/~tibs/ElemStatLearn/ (3rd Edition is available)]<br />
<br />
===Related links to LDA & QDA===<br />
<br />
LDA:[http://www.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf]<br />
<br />
[http://www.dtreg.com/lda.htm]<br />
<br />
[http://biostatistics.oxfordjournals.org/cgi/reprint/kxj035v1.pdf Regularized linear discriminant analysis and its application in microarrays]<br />
<br />
[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf MATHEMATICAL OPERATIONS OF LDA]<br />
<br />
[http://psychology.wikia.com/wiki/Linear_discriminant_analysis Application in face recognition and in market]<br />
<br />
QDA:[http://portal.acm.org/citation.cfm?id=1314542]<br />
<br />
[http://jmlr.csail.mit.edu/papers/volume8/srivastava07a/srivastava07a.pdf Bayes QDA]<br />
<br />
[http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/seminar/slides/daniela-2x4.pdf LDA & QDA]<br />
<br />
==Principal Component Analysis ==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for classification. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>). Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
<br />
To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the slend of each digit handwritten.<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued ===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load noisy<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=statf10841Scribe&diff=6033statf10841Scribe2010-09-23T16:56:29Z<p>Hclam: </p>
<hr />
<div>{| class="wikitable"<br />
<br />
{| border="1" cellpadding="2"<br />
|-<br />
|width="100pt"|Date<br />
|width="200pt"|Name<br />
|-<br />
|Sep 21 || <br />
|-<br />
|Sep 23||<br />
|-<br />
|Sep 28 || Keith, Ho Chi Lam <br />
|-<br />
|Sep 30||<br />
|-</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3672stat341 / CM 3612009-07-30T01:19:46Z<p>Hclam: /* Acceptance/Rejection Method */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. <s>As a result, the probability distribution of each possible number appears to be uniform</s> (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F where f is defined as 0 outside of its domain, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>u_1\dots u_n</math> from <math>F(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
[[Image:HistRandNum.jpg|center|300px|"Histogram showing the expected exponentional distribution" ]]<br />
<br />
<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
[[Image:edit.JPG|thumb|left|500px|Ali: Some statements are incorrect, inaccurate or misleading. Acceptance-Rejection Method needs to be motivated in more details. ]]<br /><br /><br /><br /><br /><br /><br />
<br />
<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult to sample from directly.<br />
<br />
Let <math>g(x)</math> be a distribution that is easy to sample from and satisfies the condition: <br /><br /><br />
<br />
<math>\forall x: f(x) \leq c \cdot g(x)\ </math>, where <math> c \in \Re^+</math><br />
<br /><br /><br />
[[Image:fxcgx.JPG|thumb|right|300px|"Graph of the pdf of <math>f(x)</math> (target distribution) and <math> c g(x)</math> (proposal distribution)"]]<br />
Since c*g(x) > f(x) for all x, it is possible to obtain samples that follows f(x) by rejecting a proportion of samples drawn from c*g(x).<br /><br /><br />
<br />
This proportion depends on how different f(x) and g(x) are and may vary at different values of x.<br />
<br />
That is, if <math> f(x) \approx g(x) \text { at } x = x_1 \text { and } f(x) \ll g(x) \text { at } x = x_2 </math>, we will need to reject more samples drawn at <math> \,x_2 </math> than at <math> \,x_1 </math>.<br />
<br />
Overall, it can shown that by accepting samples drawn from g(x) with probability <math> \frac {f(x)}{c \cdot g(x)} </math>, we can obtain samples that follows f(x)<br />
<br /><br /><br />
<br />
Consider the example in the graph,<br /><br />
Sampling y = 7 from <math> cg(x)</math> will yield a sample that follows the target distribution <math>f(x)</math> and will y be accepted w/p 1.<br />
<br />
Sampling y = 9 from <math> cg(x)</math> will yield a point that is distant from <math>f(x)</math> and will be accepted with a low probability.<br />
<br />
'''Proof'''<br />
[[Image:edit.JPG|thumb|left|500px|Ali: Proof of what? . ]]<br /><br /><br /><br /><br /><br /><br />
<br />
Show that if points are sampled according to the Acceptance/Rejection method then they follow the target distribution.<br /><br /><br />
<br />
<math> P(X=x|accept) = \frac{P(accept|X=x)P(X=x)}{P(accept)}</math><br /> <br />
by Bayes' theorem<br />
<br /><br /><br />
<math>\begin{align} &P(accept|X=x) = \frac{f(x)}{c \cdot g(x)}\\ &Pr(X=x) = g(x)\frac{}{} \end{align}</math><br /><br />by hypothesis.<br /><br /><br />
<br />
Then,<br /><br />
<math>\begin{align} P(accept) &= \int^{}_x P(accept|X=x)P(X=x) dx \\<br />
&= \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx\\<br />
&= \frac{1}{c} \int^{}_x f(x) dx\\<br />
&= \frac{1}{c} \end{align} </math><br />
<br /><br /><br />
Therefore,<br /><br />
<math> P(X=x|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)}g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
In this lecture we will continue to discuss sampling from specific distributions , introduce '''Monte Carlo Integration''', and also talk about the differences between the Bayesian and Frequentist views on probability, along with references to '''Bayesian Inference'''.<br />
<br />
====Binomial Distribution====<br />
A Binomial distribution <math>X \sim~ Bin(n,p) </math> is the sum of <math>n</math> independent Bernoulli trials, each with probability of success <math>p</math> <math>(0 \leq p \leq 1)</math>. For each trial we generate an independent uniform random variable: <math>U_1, \ldots, U_n \sim~ Unif(0,1)</math>. Then X is the number of times that <math>U_i \leq p</math>. In this case if n is large enough, by the central limit theorem, the Normal distribution can be used to approximate a Binomial distribution.<br />
<br />
Sampling from Binomial distribution in Matlab is done using the following code:<br />
n=3;<br />
p=0.5;<br />
trials=1000;<br />
X=sum((rand(trials,n))'<=p);<br />
hist(X)<br />
<br />
Where the histogram is a Binomial distribution, and for higher <math>n</math>, it would resemble a Normal distribution.<br />
<br />
====Monte Carlo Integration====<br />
<br />
Monte Carlo Integration is a numerical method of approximating the evaluation of integrals using random numbers generated from simulations. In this course we will mainly look at three methods for approximating integrals:<br />
# Basic Monte Carlo Integration<br />
# Importance Sampling<br />
# Markov Chain Monte Carlo (MCMC)<br />
<br />
====Bayesian VS Frequentists====<br />
<br />
During the history of statistics, two major schools of thought emerged along the way and have been locked in an on-going struggle in trying to determine which one has the correct view on probability. These two schools are known as the Bayesian and Frequentist schools of thought. Both the Bayesians and the Frequentists holds a different philosophical view on what defines probability. Below are some fundamental differences between the Bayesian and Frequentist schools of thought:<br />
<br />
'''Frequentist'''<br />
*Probability is '''objective''' and refers to the limit of an event's relative frequency in a large number of trials. For example, a coin with a 50% probability of heads will turn up heads 50% of the time.<br />
*Parameters are all fixed and unknown constants.<br />
*Any statistical process only has interpretations based on limited frequencies. For example, a 95% C.I. of a given parameter will contain the true value of the parameter 95% of the time.<br />
<br />
'''Bayesian'''<br />
*Probability is '''subjective''' and can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can refer to tomorrow's weather as having 50% of rain, whereas this would not make sense to a Frequentist because tomorrow is just one unique event, and cannot be referred to as a relative frequency in a large number of trials.<br />
*Parameters are random variables that has a given distribution, and other probability statements can be made about them.<br />
*Probability has a distribution over the parameters, and point estimates are usually done by either taking the mode or the mean of the distribution. <br />
<br />
====Bayesian Inference====<br />
<br />
'''Example''':<br />
If we have a screen that only displays single digits from 0 to 9, and this screen is split into a 4x5 matrix of pixels, then all together the 20 pixels that make up the screen can be referred to as <math>\vec{X}</math>, which is our data, and the parameter of the data for this case, which we will refer to as <math> \theta </math>, would be a discrete random variable that can take on the values of 0 to 9. In this example, a Bayesian would be interested in finding <math> Pr(\theta=a|\vec{X}=\vec{x})</math>, whereas a Frequentist would be more interested in finding <math> Pr(\vec{X}=\vec{x}|\theta=a)</math><br />
<br />
=====Bayes' Rule=====<br />
<br />
:<math>f(\theta|X) = \frac{f(X | \theta)\, f(\theta)}{f(X)}.</math><br />
<br />
Note: In this case <math>f (\theta|X)</math> is referred to as '''posterior''', <math>f (X | \theta)</math> as '''likelihood''', <math>f (\theta)</math> as '''prior''', and <math>f (X)</math> as the '''marginal''', where <math>\theta</math> is the parameter and <math>X</math> is the observed variable.<br />
<br />
'''Procedure in Bayesian Inference'''<br />
*First choose a probability distribution as the prior, which represents our beliefs about the parameters.<br />
*Then choose a probability distribution for the likelihood, which represents our beliefs about the data.<br />
*Lastly compute the posterior, which represents an update of our beliefs about the parameters after having observed the data.<br />
<br />
As mentioned before, for a Bayesian, finding point estimates usually involves finding the mode or the mean of the parameter's distribution.<br />
<br />
'''Methods'''<br />
*Mode: <math>\theta = \arg\max_{\theta} f(\theta|X) \gets</math> value of <math>\theta</math> that maximizes <math>f(\theta|X)</math><br />
*Mean: <math> \bar\theta = \int^{}_\theta \theta \cdot f(\theta|X)d\theta</math><br />
<br />
If it is the case that <math>\theta</math> is high-dimensional, and we are only interested in one of the components of <math>\theta</math>, for example, we want <math>\theta_1</math> from <math> \vec{\theta}=(\theta_1,\dots,\theta_n)</math>, then we would have to calculate the integral: <math>\int^{} \int^{} \dots \int^{}f(\theta|X)d\theta_2d\theta_3 \dots d\theta_n </math><br />
<br />
This sort of calculation is usually very difficult or not feasible to compute, and thus we would need to do it by simulation.<br />
<br />
'''Note''': <br />
#<math>f(x)=\int^{}_\theta f(X | \theta)f(\theta) d\theta</math> is not a function of <math>\theta</math>, and is called the '''Normalization Factor'''<br />
#Therefore, since f(x) is like a constant, the posterior is proportional to the likelihood times the prior: <math>f(\theta|X)\propto f(X | \theta)f(\theta)</math><br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from <math>f(x)</math>, the 'target' distribution. Choose <math>q(y | x)</math>, the 'proposal' distribution that is easily sampled from.<br /><br /><br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
<br />
<gallery widths="250px" heights="250px"><br />
Image:Blarge.jpg|B too large (B=1000)<br />
Image:Bsmall.JPG|B too small (B=0.001<br />
Image:Bgood.JPG|Good choice of B (B=2)<br />
</gallery><br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
Simulated Annealing is a method of optimization and an application of the Metropolis Hastings Algorithm.<br />
<br />
Consider the problem where we want to find <math>x</math> such that the objective function <math>h(x)</math> is at it's minimum,<br /><br /><br />
<br />
<math>\ \min_{x}(h(x)) </math><br /><br /><br />
<br />
Given a constant T and since the exponential function is monotone, this optimization problem is equivalent to,<br /><br /><br />
<br />
<math>\ \max_{x}(e^{\frac{-h(x)}{T}})</math> <br /><br /><br />
<br />
We consider a distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br /><br />IE. <math>X_{i+1} = Y</math> if <math>U<r</math><br /><br /><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is arbitrarily small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
<gallery widths="250px" heights="250px"><br />
Image:ezplotf1.jpg|T = 100<br />
Image:ezplotf2.jpg|T = 0.1<br />
</gallery><br />
<br />
<br />
In the end, T is small and the region we are trying to sample from becomes sharper. The points that we accept are increasingly close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
In practice the algorithm may get 'stuck' in another local minimum nearby for T too small and we don't get the convergence we looking for.<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1^* \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2^* \sim~ f(x_2|x_1^*, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n^* \sim~ f(x_n|x_1^*, x_2^*, ... , x_{n-1}^*)<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1^*, x_2^*, \ldots, x_n^*) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0,\ \forall x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0,\ \forall <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in Principal Component Analysis).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princomp (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
[[File:Plot1.jpg]]<br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
[[File:Plot2.jpg]]<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
[[File:Plot3.jpg]]<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
figure<br />
subplot(1,2,1)<br />
imagesc(reshape(X(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(X(:,93),8,8)')<br />
[[File:Plot4.jpg]]<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to, depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br> (where <math>\displaystyle \sum_1 and \sum_2 </math> are the covariance matrices for the 1st and 2nd class of data respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br />
The optimization problem we want to solve is,<br />
<br /><br /> <br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2, </math><br />
<br /><br /><br />
<math>\begin{align} (w^T \mu_1 - w^T \mu_2)^2 &= (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) \end{align}</math><br />
<br /><br /><br />
which is a scalar. Therefore,<br />
<br /><br /><br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br /><br /><br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br /><br /><br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br /><br /><br />
Thus our original problem equivalent can be written as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 = \displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br /><br /><br />
For a two class problem the between class variance is,<br />
<br /><br /><br />
<math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math><br />
<br /><br /> <br />
Then this problem can be rewritten as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
# <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br /><br /><br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br /><br />
<math>\displaystyle \frac{(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)} {(w^T(\sum_1 + \sum_2)w)} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max \frac{(w^Ts_Bw)}{(w^Ts_ww)}</math><br />
<br />
<br />
<br />
This is a very famous problem which is called "the generalized eigenvector problem". We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we solve a problem similar to that in PCA,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math> <br /><br />
subject to <math>\displaystyle (w^Ts_Ww=1)</math> <br />
<br /><br /><br />
We solve the following Lagrange Multiplier problem,<br />
<br /><br /><br />
<math>\displaystyle L(w,\lambda) = w^Ts_Bw - \lambda (w^Ts_Ww -1) </math><br /><br /><br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
As discussed in the previous lecture, our Optimization problem for FDA is:<br />
<br />
<math><br />
\max_{w} \frac{w^T s_B w}{w^T s_w w}<br />
</math><br />
<br />
which we turned into<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />Subject to: <br />
<math>\displaystyle (w^Ts_ww=1)</math> <br />
<br />
where <math>s_B</math> is the covariance matrix between classes and <math>s_w</math> is the covariance matrix within classes.<br />
<br />
Using Lagrange multipliers, we have a Partial solution to: <br />
<math>\displaystyle (w^Ts_Bw) - \lambda \cdot [(w^Ts_ww)-1] </math><br />
<br />
- The optimal solution for w is the eigenvector of <br />
<math>\displaystyle s_w^{-1}s_B </math> <br />
corresponding to the largest eigenvalue;<br />
<br />
- For a k class problem, we will take the eigenvectors corresponding to the (k-1) highest eigenvalues. <br />
<br />
- In the case of two-class problem, the optimal solution for w can be simplfied, such that:<br />
<math>\displaystyle w \propto s_w^{-1}(\mu_2 - \mu_1) </math> <br />
<br />
=====Example:=====<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods. <br />
%First of all, we generate the two data set:<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300) <br />
%In this case: mu_1=[1;1]; Sigma_1=[1 1.5; 1.5 3], where mu and sigma are the mean and covariance matrix.<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300) <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
X=[X1,X2]<br />
X=X'<br />
[coefs, scores]=princomp(X');<br />
coefs(:,1) %first principal component<br />
coefs(:,1)<br />
<br />
ans =<br />
0.76355476446932<br />
0.64574307712603<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=sw\[4; 2] % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is no overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, <math>Y</math>, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable <math>X</math> is the image. (In the case in which Y is a continuous variable, classification is an application of regression.)<br />
<br />
Consider independent and identically distributed data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> where <math> \displaystyle x_i \in X \subset \mathbb{R}^d </math> and <math> y_i \in Y</math> and Y is a finite set of discrete values. The set of data points <math> \displaystyle \{(x_1,y_1),...,(x_N,y_N)\} </math> is called the ''training set.''<br />
<br />
Then <math>\displaystyle h(x)</math> is a classifier where, given a new data point <math> \displaystyle x</math>, <math> \displaystyle h(x)</math> predicts <math> \displaystyle y</math>. The function <math> \displaystyle h</math> is found using the training set. i.e. the set ''trains'' <math> \displaystyle h</math> to map <math> \displaystyle X</math> to <math> \displaystyle Y</math>:<br />
<br />
<math>\, y = h(x)</math><br />
<br />
<math>\, h: X \to Y</math><br />
<br />
<br />
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.<br />
<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we find the emperical error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide by the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
Error rate is also called misclassification rate, and 1 minus the error rate is sometimes called the classification rate.<br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = \{0,1\} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator can be written as <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered to be the best classifier in terms of error rate. A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math>. A real-life interpretation of the marginal<math>\, Pr(Y=1 \mid X=x)</math> may even deal with patterns and meaning, which provides an extra challenge in finding a mathematical interpretation.<br />
[[Image:Set Differentiation.jpg|right]]<br />
In the image, which set would it be more appropriate for the question mark to belong to?<br />
<br />
<br />
This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques. One such technique is finding the Decision Boundary.<br />
<br />
=== Decision Boundary ===<br />
<br />
[[Image:Decision_boundary_Joanna.jpg|thumb|right|250px]]<br />
<br />
The Decision boundary is given by:<br />
<br />
<math>\, Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x) </math><br />
<br />
Suggesting those points where the probabilities of being in both classes are identical. Thus,<br />
<br />
<br />
<math>\, D: \{ x \mid Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x)\} </math><br />
<br /><br /><br />
Linear discriminant analysis has a decision boundary represented by a linear function, while quadratic discriminant analysis has a decision boundary represented by a quadratic function.<br />
<br /><br /><br /><br />
<br />
=== Linear Discriminant Analysis(LDA) - July 23===<br />
==== Motivation ====<br />
[[Image:LDAmulti.jpg|thumb|right|250px|"LDA decision boundary for 2 classes of multivariate normal data"]]<br />
<br />
We would like to apply Bayes Classifiaction rule by approximating the class conditional density <br />
<math>\, f_k(x) </math> and the prior <math>\, \pi_k </math><br />
<math>P(Y=k|X=x) = \frac{f_k(x)\pi_k}{\sum_{\forall{k}}f_k(x_k)\pi_k}</math><br />
<br /><br /><br />
<br />
By making the following assumptions we can find a linear approximation to the boundary given by Bayes rule,<br />
#The class conditional density is multivariate gaussian<br />
#The classes have a common covariance matrix<br />
<br />
==== Derivation ====<br />
<br />
:'''Note on Quadratic Form'''<br /><br /><br />
: <math>\, (x + a)^TA(x+b) = x^TAx + a^TAb + x^TAb + a^TAx </math><br />
<br />
<br />
<br />
By assumption (1),<br /><br /><br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}</math><br /><br /><br />
<br />
where <math>\, \Sigma_k </math> is the class covariance matrix and <math>\, \mu_k </math> is the class mean. By definition of the decision boundary (decision boundary between class <math>\ k </math> and class <math>\ l </math>),<br /><br /><br />
<math>\begin{align}&P(Y=k|X=x) = P(Y=l|X=x)\\ \\ &\frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}\pi_k = \frac{1}{(2\pi)^{d/2}|\Sigma_l|^{1/2}}e^{-\frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l)}\pi_l\end{align} </math><br />
<br /><br /><br />
By assumption (2),<br /><br /><br />
<br />
<math>\begin{align}& \Sigma_k = \Sigma_l = \Sigma\\ \\ & e^{-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)}\pi_k = e^{-\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l)} \pi_l \\ & -\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k) + \log(\pi_k) = -\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l) + \log(\pi_l) \\ & -\frac{1}{2}(x^T\Sigma^{-1}x + \mu_k^T\Sigma^{-1}\mu_k - 2x^T\mu_k) + \frac{1}{2}(x^T\Sigma^{-1}x + \mu_l^T\Sigma^{-1}\mu_l - 2x^T\mu_l) + \log{\frac{\pi_k}{\pi_l}} = 0\ \\ \\ & x^T(\mu_k - \mu_l) + \frac{1}{2}(\mu_l - \mu_k)^T\Sigma^{-1}(\mu_l + \mu_k) + \log{\frac{\pi_k}{\pi_l}} = 0 \end{align}</math><br />
<br /><br /><br />
The result is a linear function of <math>\ x </math> of the form <math>\, x^Ta + b = 0 </math>.<br />
<br />
=====Example:=====<br />
<br />
%Load data set<br />
load 2_3;<br />
[coefs, scores] = princomp(X');<br />
size(X)<br />
% ans = 64 400 % 64 principal components<br />
size(coefs)<br />
% ans = 64 64 <br />
size(scores)<br />
% ans = 400 64<br />
Y=scores(:, 1:2);<br />
% just use two of the 64 principal components<br />
plot(Y(1:200, 1),Y(1:200, 2), 'b.')<br />
hold on<br />
plot(Y(201:400, 1),Y(201:400, 2), 'r.')<br />
<br />
[[File:Pca_2.jpg]]<br />
<br />
ll=[zeros(200,1),ones(200,1)];<br />
[C,err,P,logp,coeff] = classify(Y, Y, ll', 'linear');<br />
<br />
==== Computational Method ====<br />
<br />
We can implement this computationally by the following:<br />
<br />
Define two variables, <math>\, \delta_k </math> and <math>\, \delta_l </math><br />
<br />
<math> \,\delta_k = log(f_k(x)\pi_k) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_k|) </math><br />
<br />
<math> \,\delta_l = log(f_l(x)\pi_l) = log (\pi_l) - \frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_l|) </math><br />
<br />
<br />
<br />
To classify a point, <math>\ x </math>, first compute <math>\, \delta_k </math> and <math>\, \delta_l </math>. <br /><br /><br />
Classify it to class <math>\ k </math> if <math>\, \delta_k > \delta_l </math> and vise versa. <br /><br /><br />
<br />
:<math><br />
h(x) = \begin{cases}<br />
k, & \text{if } \delta_k > \delta_l \\<br />
l, & otherwise\\<br />
\end{cases}</math><br />
<br />
(note: since <math> - \frac{d}{2}log(2\pi) </math> is a constant term, we can simply ignore it in the actual computation since it will cancel out when we do the comparison of the deltas.) <br /><br /><br />
<br />
'''Special Case: <math>\, \Sigma_k = I </math>''', the identity matrix. Then,<br />
<br />
<math> \,\delta_k = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k) - \frac{1}{2}log(|I|) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k) </math><br />
<br />
We see that in the case <math>\, \Sigma = I </math>, we can simply classify a point, <math>\ x </math>, to a class based on the distances between <math>\ x </math> and the mean of the different classes (adjusted with the log of the prior). <br /><br /><br />
<br />
'''General Case: <math>\, \Sigma_k \ne I </math>''' <br /><br /><br />
<br />
<math> \, \Sigma_k = USV^T = USU^T </math> (since <math>\ \Sigma </math> is symmetric)<br /><br /><br />
<math> \, \Sigma_k^{-1} = (USU^T)^{-1} = (U^T)^{-1}S^{-1}U^{-1} = US^{-1}U^T </math><br /><br /><br />
<br />
So, <math> (x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) </math><br />
:<math> \, = (x-\mu_k)^T US^{-1}U^T(x-\mu_k) </math><br />
:<math> \, = (U^Tx-U^T\mu_k)^T S^{-1}(U^Tx-U^T\mu_k) </math><br />
:<math> \, = (U^Tx-U^T\mu_k)^T S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^Tx-U^T\mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^Tx-S^{-\frac{1}{2}}U^T\mu_k)^T I(S^{-\frac{1}{2}}U^Tx-S^{-\frac{1}{2}}U^T\mu_k) </math><br />
:<math> \, = (x^* - \mu_k^*)^T I(x^* - \mu_k^*) </math> <br /><br /><br />
<br />
where <math> \, x^* = S^{-\frac{1}{2}}U^Tx </math> and <math> \mu_k^* = S^{-\frac{1}{2}}U^T\mu_k </math><br />
<br />
Hence the approach taken should be to transpose point <math>x</math> from the beginning,<br />
<br />
i.e. Let <math> \, x^* = S^{-\frac{1}{2}}U^Tx </math> <br /><br /><br />
<br />
Then compute <math> \, \delta_k </math> and <math> \, \delta_l </math> with <math>x^*</math>, similar to the special case above. <br /><br /><br />
<br />
If the prior distributions of the 2 classes are the same, then this method only requires us to find the distances from the point x to the mean of the 2 classes. We would classify x based on the shortest distance to the mean.<br />
<br />
In the <math>\delta</math> function calculations above, <math>\pi_k = P(X = k)</math> and <math>\pi_l = P(X = l)</math>, and can be approximated using the proportions of <math>k</math> and <math>l</math> elements in the training set.<br />
<br />
== More on Quadratic Discriminant Analysis - July 28 ==</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3670stat341 / CM 3612009-07-29T23:20:11Z<p>Hclam: /* Acceptance/Rejection Method */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. <s>As a result, the probability distribution of each possible number appears to be uniform</s> (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F where f is defined as 0 outside of its domain, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>u_1\dots u_n</math> from <math>F(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
[[Image:HistRandNum.jpg|center|300px|"Histogram showing the expected exponentional distribution" ]]<br />
<br />
<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
[[Image:edit.JPG|thumb|left|500px|Ali: Some statements are incorrect, inaccurate or misleading. Acceptance-Rejection Method needs to be motivated in more details. ]]<br /><br /><br /><br /><br /><br /><br />
<br />
<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult to sample from directly.<br />
<br />
Let <math>g(x)</math> be a distribution that is easy to sample from and satisfies the condition: <br /><br /><br />
<br />
<math>\forall x: f(x) \leq c \cdot g(x)\ </math>, where <math> c \in \Re^+</math><br />
<br /><br /><br />
[[Image:fxcgx.JPG|thumb|right|300px|"Graph of the pdf of <math>f(x)</math> (target distribution) and <math> c g(x)</math> (proposal distribution)"]]<br />
Since c*g(x) > f(x) for all x, it is possible to obtain samples that follows f(x) by rejecting a proportion of samples drawn from c*g(x).<br /><br /><br />
<br />
More specifically, by accepting samples drawn from g(x) with probability <math> \frac {f(x)}{c \cdot g(x)} </math>, we can obtain samples that follows f(x)<br />
<br /><br /><br />
<br />
We would be more likely to reject a sample Y if the ratio above is << 1, which happens when <math>f(Y)</math> and <math>cg(Y)</math> are distant. However, we accept the sample point w/p 1 if <math>f(x)</math> and <math>cg(x)</math> agree when evaluated at the sampled point. This process of accepting and rejecting points will yield a sample distribution that follows the target distribution <math>f(x)</math>.<br />
<br />
<br />
<br />
In the graph,<br /><br />
Sampling y = 7 from <math> cg(x)</math> will yield a sample that follows the target distribution <math>f(x)</math> and will y be accepted w/p 1.<br />
<br />
Sampling y = 9 from <math> cg(x)</math> will yield a point that is distant from <math>f(x)</math> and will be accepted with a low probability.<br />
<br />
'''Proof'''<br />
[[Image:edit.JPG|thumb|left|500px|Ali: Proof of what? . ]]<br /><br /><br /><br /><br /><br /><br />
Show that if points are sampled according to the Acceptance/Rejection method then they follow the target distribution.<br /><br /><br />
<br />
<math> P(X=x|accept) = \frac{P(accept|X=x)P(X=x)}{P(accept)}</math><br /> <br />
by Bayes' theorem<br />
<br /><br /><br />
<math>\begin{align} &P(accept|X=x) = \frac{f(x)}{c \cdot g(x)}\\ &Pr(X=x) = g(x)\frac{}{} \end{align}</math><br /><br />by hypothesis.<br /><br /><br />
<br />
Then,<br /><br />
<math>\begin{align} P(accept) &= \int^{}_x P(accept|X=x)P(X=x) dx \\<br />
&= \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx\\<br />
&= \frac{1}{c} \int^{}_x f(x) dx\\<br />
&= \frac{1}{c} \end{align} </math><br />
<br /><br /><br />
Therefore,<br /><br />
<math> P(X=x|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)}g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
In this lecture we will continue to discuss sampling from specific distributions , introduce '''Monte Carlo Integration''', and also talk about the differences between the Bayesian and Frequentist views on probability, along with references to '''Bayesian Inference'''.<br />
<br />
====Binomial Distribution====<br />
A Binomial distribution <math>X \sim~ Bin(n,p) </math> is the sum of <math>n</math> independent Bernoulli trials, each with probability of success <math>p</math> <math>(0 \leq p \leq 1)</math>. For each trial we generate an independent uniform random variable: <math>U_1, \ldots, U_n \sim~ Unif(0,1)</math>. Then X is the number of times that <math>U_i \leq p</math>. In this case if n is large enough, by the central limit theorem, the Normal distribution can be used to approximate a Binomial distribution.<br />
<br />
Sampling from Binomial distribution in Matlab is done using the following code:<br />
n=3;<br />
p=0.5;<br />
trials=1000;<br />
X=sum((rand(trials,n))'<=p);<br />
hist(X)<br />
<br />
Where the histogram is a Binomial distribution, and for higher <math>n</math>, it would resemble a Normal distribution.<br />
<br />
====Monte Carlo Integration====<br />
<br />
Monte Carlo Integration is a numerical method of approximating the evaluation of integrals using random numbers generated from simulations. In this course we will mainly look at three methods for approximating integrals:<br />
# Basic Monte Carlo Integration<br />
# Importance Sampling<br />
# Markov Chain Monte Carlo (MCMC)<br />
<br />
====Bayesian VS Frequentists====<br />
<br />
During the history of statistics, two major schools of thought emerged along the way and have been locked in an on-going struggle in trying to determine which one has the correct view on probability. These two schools are known as the Bayesian and Frequentist schools of thought. Both the Bayesians and the Frequentists holds a different philosophical view on what defines probability. Below are some fundamental differences between the Bayesian and Frequentist schools of thought:<br />
<br />
'''Frequentist'''<br />
*Probability is '''objective''' and refers to the limit of an event's relative frequency in a large number of trials. For example, a coin with a 50% probability of heads will turn up heads 50% of the time.<br />
*Parameters are all fixed and unknown constants.<br />
*Any statistical process only has interpretations based on limited frequencies. For example, a 95% C.I. of a given parameter will contain the true value of the parameter 95% of the time.<br />
<br />
'''Bayesian'''<br />
*Probability is '''subjective''' and can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can refer to tomorrow's weather as having 50% of rain, whereas this would not make sense to a Frequentist because tomorrow is just one unique event, and cannot be referred to as a relative frequency in a large number of trials.<br />
*Parameters are random variables that has a given distribution, and other probability statements can be made about them.<br />
*Probability has a distribution over the parameters, and point estimates are usually done by either taking the mode or the mean of the distribution. <br />
<br />
====Bayesian Inference====<br />
<br />
'''Example''':<br />
If we have a screen that only displays single digits from 0 to 9, and this screen is split into a 4x5 matrix of pixels, then all together the 20 pixels that make up the screen can be referred to as <math>\vec{X}</math>, which is our data, and the parameter of the data for this case, which we will refer to as <math> \theta </math>, would be a discrete random variable that can take on the values of 0 to 9. In this example, a Bayesian would be interested in finding <math> Pr(\theta=a|\vec{X}=\vec{x})</math>, whereas a Frequentist would be more interested in finding <math> Pr(\vec{X}=\vec{x}|\theta=a)</math><br />
<br />
=====Bayes' Rule=====<br />
<br />
:<math>f(\theta|X) = \frac{f(X | \theta)\, f(\theta)}{f(X)}.</math><br />
<br />
Note: In this case <math>f (\theta|X)</math> is referred to as '''posterior''', <math>f (X | \theta)</math> as '''likelihood''', <math>f (\theta)</math> as '''prior''', and <math>f (X)</math> as the '''marginal''', where <math>\theta</math> is the parameter and <math>X</math> is the observed variable.<br />
<br />
'''Procedure in Bayesian Inference'''<br />
*First choose a probability distribution as the prior, which represents our beliefs about the parameters.<br />
*Then choose a probability distribution for the likelihood, which represents our beliefs about the data.<br />
*Lastly compute the posterior, which represents an update of our beliefs about the parameters after having observed the data.<br />
<br />
As mentioned before, for a Bayesian, finding point estimates usually involves finding the mode or the mean of the parameter's distribution.<br />
<br />
'''Methods'''<br />
*Mode: <math>\theta = \arg\max_{\theta} f(\theta|X) \gets</math> value of <math>\theta</math> that maximizes <math>f(\theta|X)</math><br />
*Mean: <math> \bar\theta = \int^{}_\theta \theta \cdot f(\theta|X)d\theta</math><br />
<br />
If it is the case that <math>\theta</math> is high-dimensional, and we are only interested in one of the components of <math>\theta</math>, for example, we want <math>\theta_1</math> from <math> \vec{\theta}=(\theta_1,\dots,\theta_n)</math>, then we would have to calculate the integral: <math>\int^{} \int^{} \dots \int^{}f(\theta|X)d\theta_2d\theta_3 \dots d\theta_n </math><br />
<br />
This sort of calculation is usually very difficult or not feasible to compute, and thus we would need to do it by simulation.<br />
<br />
'''Note''': <br />
#<math>f(x)=\int^{}_\theta f(X | \theta)f(\theta) d\theta</math> is not a function of <math>\theta</math>, and is called the '''Normalization Factor'''<br />
#Therefore, since f(x) is like a constant, the posterior is proportional to the likelihood times the prior: <math>f(\theta|X)\propto f(X | \theta)f(\theta)</math><br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from <math>f(x)</math>, the 'target' distribution. Choose <math>q(y | x)</math>, the 'proposal' distribution that is easily sampled from.<br /><br /><br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
<br />
<gallery widths="250px" heights="250px"><br />
Image:Blarge.jpg|B too large (B=1000)<br />
Image:Bsmall.JPG|B too small (B=0.001<br />
Image:Bgood.JPG|Good choice of B (B=2)<br />
</gallery><br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
Simulated Annealing is a method of optimization and an application of the Metropolis Hastings Algorithm.<br />
<br />
Consider the problem where we want to find <math>x</math> such that the objective function <math>h(x)</math> is at it's minimum,<br /><br /><br />
<br />
<math>\ \min_{x}(h(x)) </math><br /><br /><br />
<br />
Given a constant T and since the exponential function is monotone, this optimization problem is equivalent to,<br /><br /><br />
<br />
<math>\ \max_{x}(e^{\frac{-h(x)}{T}})</math> <br /><br /><br />
<br />
We consider a distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br /><br />IE. <math>X_{i+1} = Y</math> if <math>U<r</math><br /><br /><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is arbitrarily small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
<gallery widths="250px" heights="250px"><br />
Image:ezplotf1.jpg|T = 100<br />
Image:ezplotf2.jpg|T = 0.1<br />
</gallery><br />
<br />
<br />
In the end, T is small and the region we are trying to sample from becomes sharper. The points that we accept are increasingly close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
In practice the algorithm may get 'stuck' in another local minimum nearby for T too small and we don't get the convergence we looking for.<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1^* \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2^* \sim~ f(x_2|x_1^*, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n^* \sim~ f(x_n|x_1^*, x_2^*, ... , x_{n-1}^*)<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1^*, x_2^*, \ldots, x_n^*) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0,\ \forall x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0,\ \forall <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in Principal Component Analysis).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princomp (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
[[File:Plot1.jpg]]<br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
[[File:Plot2.jpg]]<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
[[File:Plot3.jpg]]<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
figure<br />
subplot(1,2,1)<br />
imagesc(reshape(X(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(X(:,93),8,8)')<br />
[[File:Plot4.jpg]]<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to, depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br> (where <math>\displaystyle \sum_1 and \sum_2 </math> are the covariance matrices for the 1st and 2nd class of data respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br />
The optimization problem we want to solve is,<br />
<br /><br /> <br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2, </math><br />
<br /><br /><br />
<math>\begin{align} (w^T \mu_1 - w^T \mu_2)^2 &= (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) \end{align}</math><br />
<br /><br /><br />
which is a scalar. Therefore,<br />
<br /><br /><br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br /><br /><br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br /><br /><br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br /><br /><br />
Thus our original problem equivalent can be written as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 = \displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br /><br /><br />
For a two class problem the between class variance is,<br />
<br /><br /><br />
<math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math><br />
<br /><br /> <br />
Then this problem can be rewritten as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
# <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br /><br /><br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br /><br />
<math>\displaystyle \frac{(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)} {(w^T(\sum_1 + \sum_2)w)} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max \frac{(w^Ts_Bw)}{(w^Ts_ww)}</math><br />
<br />
<br />
<br />
This is a very famous problem which is called "the generalized eigenvector problem". We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we solve a problem similar to that in PCA,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math> <br /><br />
subject to <math>\displaystyle (w^Ts_Ww=1)</math> <br />
<br /><br /><br />
We solve the following Lagrange Multiplier problem,<br />
<br /><br /><br />
<math>\displaystyle L(w,\lambda) = w^Ts_Bw - \lambda (w^Ts_Ww -1) </math><br /><br /><br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
As discussed in the previous lecture, our Optimization problem for FDA is:<br />
<br />
<math><br />
\max_{w} \frac{w^T s_B w}{w^T s_w w}<br />
</math><br />
<br />
which we turned into<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />Subject to: <br />
<math>\displaystyle (w^Ts_ww=1)</math> <br />
<br />
where <math>s_B</math> is the covariance matrix between classes and <math>s_w</math> is the covariance matrix within classes.<br />
<br />
Using Lagrange multipliers, we have a Partial solution to: <br />
<math>\displaystyle (w^Ts_Bw) - \lambda \cdot [(w^Ts_ww)-1] </math><br />
<br />
- The optimal solution for w is the eigenvector of <br />
<math>\displaystyle s_w^{-1}s_B </math> <br />
corresponding to the largest eigenvalue;<br />
<br />
- For a k class problem, we will take the eigenvectors corresponding to the (k-1) highest eigenvalues. <br />
<br />
- In the case of two-class problem, the optimal solution for w can be simplfied, such that:<br />
<math>\displaystyle w \propto s_w^{-1}(\mu_2 - \mu_1) </math> <br />
<br />
=====Example:=====<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods. <br />
%First of all, we generate the two data set:<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300) <br />
%In this case: mu_1=[1;1]; Sigma_1=[1 1.5; 1.5 3], where mu and sigma are the mean and covariance matrix.<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300) <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
X=[X1,X2]<br />
X=X'<br />
[coefs, scores]=princomp(X');<br />
coefs(:,1) %first principal component<br />
coefs(:,1)<br />
<br />
ans =<br />
0.76355476446932<br />
0.64574307712603<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=sw\[4; 2] % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is no overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, <math>Y</math>, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable <math>X</math> is the image. (In the case in which Y is a continuous variable, classification is an application of regression.)<br />
<br />
Consider independent and identically distributed data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> where <math> \displaystyle x_i \in X \subset \mathbb{R}^d </math> and <math> y_i \in Y</math> and Y is a finite set of discrete values. The set of data points <math> \displaystyle \{(x_1,y_1),...,(x_N,y_N)\} </math> is called the ''training set.''<br />
<br />
Then <math>\displaystyle h(x)</math> is a classifier where, given a new data point <math> \displaystyle x</math>, <math> \displaystyle h(x)</math> predicts <math> \displaystyle y</math>. The function <math> \displaystyle h</math> is found using the training set. i.e. the set ''trains'' <math> \displaystyle h</math> to map <math> \displaystyle X</math> to <math> \displaystyle Y</math>:<br />
<br />
<math>\, y = h(x)</math><br />
<br />
<math>\, h: X \to Y</math><br />
<br />
<br />
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.<br />
<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we find the emperical error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide by the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
Error rate is also called misclassification rate, and 1 minus the error rate is sometimes called the classification rate.<br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = \{0,1\} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator can be written as <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered to be the best classifier in terms of error rate. A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math>. A real-life interpretation of the marginal<math>\, Pr(Y=1 \mid X=x)</math> may even deal with patterns and meaning, which provides an extra challenge in finding a mathematical interpretation.<br />
[[Image:Set Differentiation.jpg|right]]<br />
In the image, which set would it be more appropriate for the question mark to belong to?<br />
<br />
<br />
This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques. One such technique is finding the Decision Boundary.<br />
<br />
=== Decision Boundary ===<br />
<br />
[[Image:Decision_boundary_Joanna.jpg|thumb|right|250px]]<br />
<br />
The Decision boundary is given by:<br />
<br />
<math>\, Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x) </math><br />
<br />
Suggesting those points where the probabilities of being in both classes are identical. Thus,<br />
<br />
<br />
<math>\, D: \{ x \mid Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x)\} </math><br />
<br /><br /><br />
Linear discriminant analysis has a decision boundary represented by a linear function, while quadratic discriminant analysis has a decision boundary represented by a quadratic function.<br />
<br /><br /><br /><br />
<br />
=== Linear Discriminant Analysis(LDA) - July 23===<br />
==== Motivation ====<br />
[[Image:LDAmulti.jpg|thumb|right|250px|"LDA decision boundary for 2 classes of multivariate normal data"]]<br />
<br />
We would like to apply Bayes Classifiaction rule by approximating the class conditional density <br />
<math>\, f_k(x) </math> and the prior <math>\, \pi_k </math><br />
<math>P(Y=k|X=x) = \frac{f_k(x)\pi_k}{\sum_{\forall{k}}f_k(x_k)\pi_k}</math><br />
<br /><br /><br />
<br />
By making the following assumptions we can find a linear approximation to the boundary given by Bayes rule,<br />
#The class conditional density is multivariate gaussian<br />
#The classes have a common covariance matrix<br />
<br />
==== Derivation ====<br />
<br />
:'''Note on Quadratic Form'''<br /><br /><br />
: <math>\, (x + a)^TA(x+b) = x^TAx + a^TAb + x^TAb + a^TAx </math><br />
<br />
<br />
<br />
By assumption (1),<br /><br /><br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}</math><br /><br /><br />
<br />
where <math>\, \Sigma_k </math> is the class covariance matrix and <math>\, \mu_k </math> is the class mean. By definition of the decision boundary (decision boundary between class <math>\ k </math> and class <math>\ l </math>),<br /><br /><br />
<math>\begin{align}&P(Y=k|X=x) = P(Y=l|X=x)\\ \\ &\frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}\pi_k = \frac{1}{(2\pi)^{d/2}|\Sigma_l|^{1/2}}e^{-\frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l)}\pi_l\end{align} </math><br />
<br /><br /><br />
By assumption (2),<br /><br /><br />
<br />
<math>\begin{align}& \Sigma_k = \Sigma_l = \Sigma\\ \\ & e^{-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)}\pi_k = e^{-\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l)} \pi_l \\ & -\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k) + \log(\pi_k) = -\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l) + \log(\pi_l) \\ & -\frac{1}{2}(x^T\Sigma^{-1}x + \mu_k^T\Sigma^{-1}\mu_k - 2x^T\mu_k) + \frac{1}{2}(x^T\Sigma^{-1}x + \mu_l^T\Sigma^{-1}\mu_l - 2x^T\mu_l) + \log{\frac{\pi_k}{\pi_l}} = 0\ \\ \\ & x^T(\mu_k - \mu_l) + \frac{1}{2}(\mu_l - \mu_k)^T\Sigma^{-1}(\mu_l + \mu_k) + \log{\frac{\pi_k}{\pi_l}} = 0 \end{align}</math><br />
<br /><br /><br />
The result is a linear function of <math>\ x </math> of the form <math>\, x^Ta + b = 0 </math>.<br />
<br />
=====Example:=====<br />
<br />
%Load data set<br />
load 2_3;<br />
[coefs, scores] = princomp(X');<br />
size(X)<br />
% ans = 64 400 % 64 principal components<br />
size(coefs)<br />
% ans = 64 64 <br />
size(scores)<br />
% ans = 400 64<br />
Y=scores(:, 1:2);<br />
% just use two of the 64 principal components<br />
plot(Y(1:200, 1),Y(1:200, 2), 'b.')<br />
hold on<br />
plot(Y(201:400, 1),Y(201:400, 2), 'r.')<br />
<br />
[[File:Pca_2.jpg]]<br />
<br />
ll=[zeros(200,1),ones(200,1)];<br />
[C,err,P,logp,coeff] = classify(Y, Y, ll', 'linear');<br />
<br />
==== Computational Method ====<br />
<br />
We can implement this computationally by the following:<br />
<br />
Define two variables, <math>\, \delta_k </math> and <math>\, \delta_l </math><br />
<br />
<math> \,\delta_k = log(f_k(x)\pi_k) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_k|) </math><br />
<br />
<math> \,\delta_l = log(f_l(x)\pi_l) = log (\pi_l) - \frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_l|) </math><br />
<br />
<br />
<br />
To classify a point, <math>\ x </math>, first compute <math>\, \delta_k </math> and <math>\, \delta_l </math>. <br /><br /><br />
Classify it to class <math>\ k </math> if <math>\, \delta_k > \delta_l </math> and vise versa. <br /><br /><br />
<br />
:<math><br />
h(x) = \begin{cases}<br />
k, & \text{if } \delta_k > \delta_l \\<br />
l, & otherwise\\<br />
\end{cases}</math><br />
<br />
(note: since <math> - \frac{d}{2}log(2\pi) </math> is a constant term, we can simply ignore it in the actual computation since it will cancel out when we do the comparison of the deltas.) <br /><br /><br />
<br />
'''Special Case: <math>\, \Sigma_k = I </math>''', the identity matrix. Then,<br />
<br />
<math> \,\delta_k = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k) - \frac{1}{2}log(|I|) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k) </math><br />
<br />
We see that in the case <math>\, \Sigma = I </math>, we can simply classify a point, <math>\ x </math>, to a class based on the distances between <math>\ x </math> and the mean of the different classes (adjusted with the log of the prior). <br /><br /><br />
<br />
'''General Case: <math>\, \Sigma_k \ne I </math>''' <br /><br /><br />
<br />
<math> \, \Sigma_k = USV^T = USU^T </math> (since <math>\ \Sigma </math> is symmetric)<br /><br /><br />
<math> \, \Sigma_k^{-1} = (USU^T)^{-1} = (U^T)^{-1}S^{-1}U^{-1} = US^{-1}U^T </math><br /><br /><br />
<br />
So, <math> (x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) </math><br />
:<math> \, = (x-\mu_k)^T US^{-1}U^T(x-\mu_k) </math><br />
:<math> \, = (U^Tx-U^T\mu_k)^T S^{-1}(U^Tx-U^T\mu_k) </math><br />
:<math> \, = (U^Tx-U^T\mu_k)^T S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^Tx-U^T\mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^Tx-S^{-\frac{1}{2}}U^T\mu_k)^T I(S^{-\frac{1}{2}}U^Tx-S^{-\frac{1}{2}}U^T\mu_k) </math><br />
:<math> \, = (x^* - \mu_k^*)^T I(x^* - \mu_k^*) </math> <br /><br /><br />
<br />
where <math> \, x^* = S^{-\frac{1}{2}}U^Tx </math> and <math> \mu_k^* = S^{-\frac{1}{2}}U^T\mu_k </math><br />
<br />
Hence the approach taken should be to transpose point <math>x</math> from the beginning,<br />
<br />
i.e. Let <math> \, x^* = S^{-\frac{1}{2}}U^Tx </math> <br /><br /><br />
<br />
Then compute <math> \, \delta_k </math> and <math> \, \delta_l </math> with <math>x^*</math>, similar to the special case above. <br /><br /><br />
<br />
If the prior distributions of the 2 classes are the same, then this method only requires us to find the distances from the point x to the mean of the 2 classes. We would classify x based on the shortest distance to the mean.<br />
<br />
In the <math>\delta</math> function calculations above, <math>\pi_k = P(X = k)</math> and <math>\pi_l = P(X = l)</math>, and can be approximated using the proportions of <math>k</math> and <math>l</math> elements in the training set.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3669stat341 / CM 3612009-07-29T22:58:42Z<p>Hclam: /* Acceptance/Rejection Method */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. <s>As a result, the probability distribution of each possible number appears to be uniform</s> (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F where f is defined as 0 outside of its domain, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>u_1\dots u_n</math> from <math>F(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
[[Image:HistRandNum.jpg|center|300px|"Histogram showing the expected exponentional distribution" ]]<br />
<br />
<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
[[Image:edit.JPG|thumb|left|500px|Ali: Some statements are incorrect, inaccurate or misleading. Acceptance-Rejection Method needs to be motivated in more details. ]]<br /><br /><br /><br /><br /><br /><br />
<br />
<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult to sample from directly. <br /><br /><br />
<br />
We can apply the Acceptance-Rejection Method by considering a proposal distribution, <math>g(x)</math>, that is easy to sample from, and satisfies the condition, <br /><br /><br />
<br />
<math>\forall x: f(x) \leq c \cdot g(x)\ </math>, where <math> c \in \Re^+</math><br />
<br /><br /><br />
[[Image:fxcgx.JPG|thumb|right|300px|"Graph of the pdf of <math>f(x)</math> (target distribution) and <math> c g(x)</math> (proposal distribution)"]]<br />
Let <math> Y = c \cdot g(x)\ </math> <br /><br /><br />
Generate many samples of Y and accept them with probability <math> \frac {f(Y)}{c \cdot g(Y)} </math> will yield a sample of f(x), the target distribution<br />
<br /><br /><br />
We would be more likely to reject a sample Y if the ratio above is << 1, which happens when <math>f(Y)</math> and <math>cg(Y)</math> are distant. However, we accept the sample point w/p 1 if <math>f(x)</math> and <math>cg(x)</math> agree when evaluated at the sampled point. This process of accepting and rejecting points will yield a sample distribution that follows the target distribution <math>f(x)</math>.<br />
<br />
<br />
<br />
In the graph,<br /><br />
Sampling y = 7 from <math> cg(x)</math> will yield a sample that follows the target distribution <math>f(x)</math> and will y be accepted w/p 1.<br />
<br />
Sampling y = 9 from <math> cg(x)</math> will yield a point that is distant from <math>f(x)</math> and will be accepted with a low probability.<br />
<br />
'''Proof'''<br />
[[Image:edit.JPG|thumb|left|500px|Ali: Proof of what? . ]]<br /><br /><br /><br /><br /><br /><br />
Show that if points are sampled according to the Acceptance/Rejection method then they follow the target distribution.<br /><br /><br />
<br />
<math> P(X=x|accept) = \frac{P(accept|X=x)P(X=x)}{P(accept)}</math><br /> <br />
by Bayes' theorem<br />
<br /><br /><br />
<math>\begin{align} &P(accept|X=x) = \frac{f(x)}{c \cdot g(x)}\\ &Pr(X=x) = g(x)\frac{}{} \end{align}</math><br /><br />by hypothesis.<br /><br /><br />
<br />
Then,<br /><br />
<math>\begin{align} P(accept) &= \int^{}_x P(accept|X=x)P(X=x) dx \\<br />
&= \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx\\<br />
&= \frac{1}{c} \int^{}_x f(x) dx\\<br />
&= \frac{1}{c} \end{align} </math><br />
<br /><br /><br />
Therefore,<br /><br />
<math> P(X=x|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)}g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
In this lecture we will continue to discuss sampling from specific distributions , introduce '''Monte Carlo Integration''', and also talk about the differences between the Bayesian and Frequentist views on probability, along with references to '''Bayesian Inference'''.<br />
<br />
====Binomial Distribution====<br />
A Binomial distribution <math>X \sim~ Bin(n,p) </math> is the sum of <math>n</math> independent Bernoulli trials, each with probability of success <math>p</math> <math>(0 \leq p \leq 1)</math>. For each trial we generate an independent uniform random variable: <math>U_1, \ldots, U_n \sim~ Unif(0,1)</math>. Then X is the number of times that <math>U_i \leq p</math>. In this case if n is large enough, by the central limit theorem, the Normal distribution can be used to approximate a Binomial distribution.<br />
<br />
Sampling from Binomial distribution in Matlab is done using the following code:<br />
n=3;<br />
p=0.5;<br />
trials=1000;<br />
X=sum((rand(trials,n))'<=p);<br />
hist(X)<br />
<br />
Where the histogram is a Binomial distribution, and for higher <math>n</math>, it would resemble a Normal distribution.<br />
<br />
====Monte Carlo Integration====<br />
<br />
Monte Carlo Integration is a numerical method of approximating the evaluation of integrals using random numbers generated from simulations. In this course we will mainly look at three methods for approximating integrals:<br />
# Basic Monte Carlo Integration<br />
# Importance Sampling<br />
# Markov Chain Monte Carlo (MCMC)<br />
<br />
====Bayesian VS Frequentists====<br />
<br />
During the history of statistics, two major schools of thought emerged along the way and have been locked in an on-going struggle in trying to determine which one has the correct view on probability. These two schools are known as the Bayesian and Frequentist schools of thought. Both the Bayesians and the Frequentists holds a different philosophical view on what defines probability. Below are some fundamental differences between the Bayesian and Frequentist schools of thought:<br />
<br />
'''Frequentist'''<br />
*Probability is '''objective''' and refers to the limit of an event's relative frequency in a large number of trials. For example, a coin with a 50% probability of heads will turn up heads 50% of the time.<br />
*Parameters are all fixed and unknown constants.<br />
*Any statistical process only has interpretations based on limited frequencies. For example, a 95% C.I. of a given parameter will contain the true value of the parameter 95% of the time.<br />
<br />
'''Bayesian'''<br />
*Probability is '''subjective''' and can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can refer to tomorrow's weather as having 50% of rain, whereas this would not make sense to a Frequentist because tomorrow is just one unique event, and cannot be referred to as a relative frequency in a large number of trials.<br />
*Parameters are random variables that has a given distribution, and other probability statements can be made about them.<br />
*Probability has a distribution over the parameters, and point estimates are usually done by either taking the mode or the mean of the distribution. <br />
<br />
====Bayesian Inference====<br />
<br />
'''Example''':<br />
If we have a screen that only displays single digits from 0 to 9, and this screen is split into a 4x5 matrix of pixels, then all together the 20 pixels that make up the screen can be referred to as <math>\vec{X}</math>, which is our data, and the parameter of the data for this case, which we will refer to as <math> \theta </math>, would be a discrete random variable that can take on the values of 0 to 9. In this example, a Bayesian would be interested in finding <math> Pr(\theta=a|\vec{X}=\vec{x})</math>, whereas a Frequentist would be more interested in finding <math> Pr(\vec{X}=\vec{x}|\theta=a)</math><br />
<br />
=====Bayes' Rule=====<br />
<br />
:<math>f(\theta|X) = \frac{f(X | \theta)\, f(\theta)}{f(X)}.</math><br />
<br />
Note: In this case <math>f (\theta|X)</math> is referred to as '''posterior''', <math>f (X | \theta)</math> as '''likelihood''', <math>f (\theta)</math> as '''prior''', and <math>f (X)</math> as the '''marginal''', where <math>\theta</math> is the parameter and <math>X</math> is the observed variable.<br />
<br />
'''Procedure in Bayesian Inference'''<br />
*First choose a probability distribution as the prior, which represents our beliefs about the parameters.<br />
*Then choose a probability distribution for the likelihood, which represents our beliefs about the data.<br />
*Lastly compute the posterior, which represents an update of our beliefs about the parameters after having observed the data.<br />
<br />
As mentioned before, for a Bayesian, finding point estimates usually involves finding the mode or the mean of the parameter's distribution.<br />
<br />
'''Methods'''<br />
*Mode: <math>\theta = \arg\max_{\theta} f(\theta|X) \gets</math> value of <math>\theta</math> that maximizes <math>f(\theta|X)</math><br />
*Mean: <math> \bar\theta = \int^{}_\theta \theta \cdot f(\theta|X)d\theta</math><br />
<br />
If it is the case that <math>\theta</math> is high-dimensional, and we are only interested in one of the components of <math>\theta</math>, for example, we want <math>\theta_1</math> from <math> \vec{\theta}=(\theta_1,\dots,\theta_n)</math>, then we would have to calculate the integral: <math>\int^{} \int^{} \dots \int^{}f(\theta|X)d\theta_2d\theta_3 \dots d\theta_n </math><br />
<br />
This sort of calculation is usually very difficult or not feasible to compute, and thus we would need to do it by simulation.<br />
<br />
'''Note''': <br />
#<math>f(x)=\int^{}_\theta f(X | \theta)f(\theta) d\theta</math> is not a function of <math>\theta</math>, and is called the '''Normalization Factor'''<br />
#Therefore, since f(x) is like a constant, the posterior is proportional to the likelihood times the prior: <math>f(\theta|X)\propto f(X | \theta)f(\theta)</math><br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from <math>f(x)</math>, the 'target' distribution. Choose <math>q(y | x)</math>, the 'proposal' distribution that is easily sampled from.<br /><br /><br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
<br />
<gallery widths="250px" heights="250px"><br />
Image:Blarge.jpg|B too large (B=1000)<br />
Image:Bsmall.JPG|B too small (B=0.001<br />
Image:Bgood.JPG|Good choice of B (B=2)<br />
</gallery><br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
Simulated Annealing is a method of optimization and an application of the Metropolis Hastings Algorithm.<br />
<br />
Consider the problem where we want to find <math>x</math> such that the objective function <math>h(x)</math> is at it's minimum,<br /><br /><br />
<br />
<math>\ \min_{x}(h(x)) </math><br /><br /><br />
<br />
Given a constant T and since the exponential function is monotone, this optimization problem is equivalent to,<br /><br /><br />
<br />
<math>\ \max_{x}(e^{\frac{-h(x)}{T}})</math> <br /><br /><br />
<br />
We consider a distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br /><br />IE. <math>X_{i+1} = Y</math> if <math>U<r</math><br /><br /><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is arbitrarily small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
<gallery widths="250px" heights="250px"><br />
Image:ezplotf1.jpg|T = 100<br />
Image:ezplotf2.jpg|T = 0.1<br />
</gallery><br />
<br />
<br />
In the end, T is small and the region we are trying to sample from becomes sharper. The points that we accept are increasingly close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
In practice the algorithm may get 'stuck' in another local minimum nearby for T too small and we don't get the convergence we looking for.<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1^* \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2^* \sim~ f(x_2|x_1^*, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n^* \sim~ f(x_n|x_1^*, x_2^*, ... , x_{n-1}^*)<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1^*, x_2^*, \ldots, x_n^*) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0,\ \forall x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0,\ \forall <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in Principal Component Analysis).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princomp (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
[[File:Plot1.jpg]]<br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
[[File:Plot2.jpg]]<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
[[File:Plot3.jpg]]<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
figure<br />
subplot(1,2,1)<br />
imagesc(reshape(X(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(X(:,93),8,8)')<br />
[[File:Plot4.jpg]]<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to, depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br> (where <math>\displaystyle \sum_1 and \sum_2 </math> are the covariance matrices for the 1st and 2nd class of data respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br />
The optimization problem we want to solve is,<br />
<br /><br /> <br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2, </math><br />
<br /><br /><br />
<math>\begin{align} (w^T \mu_1 - w^T \mu_2)^2 &= (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) \end{align}</math><br />
<br /><br /><br />
which is a scalar. Therefore,<br />
<br /><br /><br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br /><br /><br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br /><br /><br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br /><br /><br />
Thus our original problem equivalent can be written as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 = \displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br /><br /><br />
For a two class problem the between class variance is,<br />
<br /><br /><br />
<math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math><br />
<br /><br /> <br />
Then this problem can be rewritten as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
# <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br /><br /><br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br /><br />
<math>\displaystyle \frac{(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)} {(w^T(\sum_1 + \sum_2)w)} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max \frac{(w^Ts_Bw)}{(w^Ts_ww)}</math><br />
<br />
<br />
<br />
This is a very famous problem which is called "the generalized eigenvector problem". We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we solve a problem similar to that in PCA,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math> <br /><br />
subject to <math>\displaystyle (w^Ts_Ww=1)</math> <br />
<br /><br /><br />
We solve the following Lagrange Multiplier problem,<br />
<br /><br /><br />
<math>\displaystyle L(w,\lambda) = w^Ts_Bw - \lambda (w^Ts_Ww -1) </math><br /><br /><br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
As discussed in the previous lecture, our Optimization problem for FDA is:<br />
<br />
<math><br />
\max_{w} \frac{w^T s_B w}{w^T s_w w}<br />
</math><br />
<br />
which we turned into<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />Subject to: <br />
<math>\displaystyle (w^Ts_ww=1)</math> <br />
<br />
where <math>s_B</math> is the covariance matrix between classes and <math>s_w</math> is the covariance matrix within classes.<br />
<br />
Using Lagrange multipliers, we have a Partial solution to: <br />
<math>\displaystyle (w^Ts_Bw) - \lambda \cdot [(w^Ts_ww)-1] </math><br />
<br />
- The optimal solution for w is the eigenvector of <br />
<math>\displaystyle s_w^{-1}s_B </math> <br />
corresponding to the largest eigenvalue;<br />
<br />
- For a k class problem, we will take the eigenvectors corresponding to the (k-1) highest eigenvalues. <br />
<br />
- In the case of two-class problem, the optimal solution for w can be simplfied, such that:<br />
<math>\displaystyle w \propto s_w^{-1}(\mu_2 - \mu_1) </math> <br />
<br />
=====Example:=====<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods. <br />
%First of all, we generate the two data set:<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300) <br />
%In this case: mu_1=[1;1]; Sigma_1=[1 1.5; 1.5 3], where mu and sigma are the mean and covariance matrix.<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300) <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
X=[X1,X2]<br />
X=X'<br />
[coefs, scores]=princomp(X');<br />
coefs(:,1) %first principal component<br />
coefs(:,1)<br />
<br />
ans =<br />
0.76355476446932<br />
0.64574307712603<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=sw\[4; 2] % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is no overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, <math>Y</math>, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable <math>X</math> is the image. (In the case in which Y is a continuous variable, classification is an application of regression.)<br />
<br />
Consider independent and identically distributed data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> where <math> \displaystyle x_i \in X \subset \mathbb{R}^d </math> and <math> y_i \in Y</math> and Y is a finite set of discrete values. The set of data points <math> \displaystyle \{(x_1,y_1),...,(x_N,y_N)\} </math> is called the ''training set.''<br />
<br />
Then <math>\displaystyle h(x)</math> is a classifier where, given a new data point <math> \displaystyle x</math>, <math> \displaystyle h(x)</math> predicts <math> \displaystyle y</math>. The function <math> \displaystyle h</math> is found using the training set. i.e. the set ''trains'' <math> \displaystyle h</math> to map <math> \displaystyle X</math> to <math> \displaystyle Y</math>:<br />
<br />
<math>\, y = h(x)</math><br />
<br />
<math>\, h: X \to Y</math><br />
<br />
<br />
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.<br />
<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we find the emperical error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide by the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
Error rate is also called misclassification rate, and 1 minus the error rate is sometimes called the classification rate.<br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = \{0,1\} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator can be written as <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered to be the best classifier in terms of error rate. A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math>. A real-life interpretation of the marginal<math>\, Pr(Y=1 \mid X=x)</math> may even deal with patterns and meaning, which provides an extra challenge in finding a mathematical interpretation.<br />
[[Image:Set Differentiation.jpg|right]]<br />
In the image, which set would it be more appropriate for the question mark to belong to?<br />
<br />
<br />
This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques. One such technique is finding the Decision Boundary.<br />
<br />
=== Decision Boundary ===<br />
<br />
[[Image:Decision_boundary_Joanna.jpg|thumb|right|250px]]<br />
<br />
The Decision boundary is given by:<br />
<br />
<math>\, Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x) </math><br />
<br />
Suggesting those points where the probabilities of being in both classes are identical. Thus,<br />
<br />
<br />
<math>\, D: \{ x \mid Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x)\} </math><br />
<br /><br /><br />
Linear discriminant analysis has a decision boundary represented by a linear function, while quadratic discriminant analysis has a decision boundary represented by a quadratic function.<br />
<br /><br /><br /><br />
<br />
=== Linear Discriminant Analysis(LDA) - July 23===<br />
==== Motivation ====<br />
[[Image:LDAmulti.jpg|thumb|right|250px|"LDA decision boundary for 2 classes of multivariate normal data"]]<br />
<br />
We would like to apply Bayes Classifiaction rule by approximating the class conditional density <br />
<math>\, f_k(x) </math> and the prior <math>\, \pi_k </math><br />
<math>P(Y=k|X=x) = \frac{f_k(x)\pi_k}{\sum_{\forall{k}}f_k(x_k)\pi_k}</math><br />
<br /><br /><br />
<br />
By making the following assumptions we can find a linear approximation to the boundary given by Bayes rule,<br />
#The class conditional density is multivariate gaussian<br />
#The classes have a common covariance matrix<br />
<br />
==== Derivation ====<br />
<br />
:'''Note on Quadratic Form'''<br /><br /><br />
: <math>\, (x + a)^TA(x+b) = x^TAx + a^TAb + x^TAb + a^TAx </math><br />
<br />
<br />
<br />
By assumption (1),<br /><br /><br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}</math><br /><br /><br />
<br />
where <math>\, \Sigma_k </math> is the class covariance matrix and <math>\, \mu_k </math> is the class mean. By definition of the decision boundary (decision boundary between class <math>\ k </math> and class <math>\ l </math>),<br /><br /><br />
<math>\begin{align}&P(Y=k|X=x) = P(Y=l|X=x)\\ \\ &\frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}\pi_k = \frac{1}{(2\pi)^{d/2}|\Sigma_l|^{1/2}}e^{-\frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l)}\pi_l\end{align} </math><br />
<br /><br /><br />
By assumption (2),<br /><br /><br />
<br />
<math>\begin{align}& \Sigma_k = \Sigma_l = \Sigma\\ \\ & e^{-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)}\pi_k = e^{-\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l)} \pi_l \\ & -\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k) + \log(\pi_k) = -\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l) + \log(\pi_l) \\ & -\frac{1}{2}(x^T\Sigma^{-1}x + \mu_k^T\Sigma^{-1}\mu_k - 2x^T\mu_k) + \frac{1}{2}(x^T\Sigma^{-1}x + \mu_l^T\Sigma^{-1}\mu_l - 2x^T\mu_l) + \log{\frac{\pi_k}{\pi_l}} = 0\ \\ \\ & x^T(\mu_k - \mu_l) + \frac{1}{2}(\mu_l - \mu_k)^T\Sigma^{-1}(\mu_l + \mu_k) + \log{\frac{\pi_k}{\pi_l}} = 0 \end{align}</math><br />
<br /><br /><br />
The result is a linear function of <math>\ x </math> of the form <math>\, x^Ta + b = 0 </math>.<br />
<br />
=====Example:=====<br />
<br />
%Load data set<br />
load 2_3;<br />
[coefs, scores] = princomp(X');<br />
size(X)<br />
% ans = 64 400 % 64 principal components<br />
size(coefs)<br />
% ans = 64 64 <br />
size(scores)<br />
% ans = 400 64<br />
Y=scores(:, 1:2);<br />
% just use two of the 64 principal components<br />
plot(Y(1:200, 1),Y(1:200, 2), 'b.')<br />
hold on<br />
plot(Y(201:400, 1),Y(201:400, 2), 'r.')<br />
<br />
[[File:Pca_2.jpg]]<br />
<br />
ll=[zeros(200,1),ones(200,1)];<br />
[C,err,P,logp,coeff] = classify(Y, Y, ll', 'linear');<br />
<br />
==== Computational Method ====<br />
<br />
We can implement this computationally by the following:<br />
<br />
Define two variables, <math>\, \delta_k </math> and <math>\, \delta_l </math><br />
<br />
<math> \,\delta_k = log(f_k(x)\pi_k) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_k|) </math><br />
<br />
<math> \,\delta_l = log(f_l(x)\pi_l) = log (\pi_l) - \frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_l|) </math><br />
<br />
<br />
<br />
To classify a point, <math>\ x </math>, first compute <math>\, \delta_k </math> and <math>\, \delta_l </math>. <br /><br /><br />
Classify it to class <math>\ k </math> if <math>\, \delta_k > \delta_l </math> and vise versa. <br /><br /><br />
<br />
:<math><br />
h(x) = \begin{cases}<br />
k, & \text{if } \delta_k > \delta_l \\<br />
l, & otherwise\\<br />
\end{cases}</math><br />
<br />
(note: since <math> - \frac{d}{2}log(2\pi) </math> is a constant term, we can simply ignore it in the actual computation since it will cancel out when we do the comparison of the deltas.) <br /><br /><br />
<br />
'''Special Case: <math>\, \Sigma_k = I </math>''', the identity matrix. Then,<br />
<br />
<math> \,\delta_k = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k) - \frac{1}{2}log(|I|) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k) </math><br />
<br />
We see that in the case <math>\, \Sigma = I </math>, we can simply classify a point, <math>\ x </math>, to a class based on the distances between <math>\ x </math> and the mean of the different classes (adjusted with the log of the prior). <br /><br /><br />
<br />
'''General Case: <math>\, \Sigma_k \ne I </math>''' <br /><br /><br />
<br />
<math> \, \Sigma_k = USV^T = USU^T </math> (since <math>\ \Sigma </math> is symmetric)<br /><br /><br />
<math> \, \Sigma_k^{-1} = (USU^T)^{-1} = (U^T)^{-1}S^{-1}U^{-1} = US^{-1}U^T </math><br /><br /><br />
<br />
So, <math> (x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) </math><br />
:<math> \, = (x-\mu_k)^T US^{-1}U^T(x-\mu_k) </math><br />
:<math> \, = (U^Tx-U^T\mu_k)^T S^{-1}(U^Tx-U^T\mu_k) </math><br />
:<math> \, = (U^Tx-U^T\mu_k)^T S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^Tx-U^T\mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^Tx-S^{-\frac{1}{2}}U^T\mu_k)^T I(S^{-\frac{1}{2}}U^Tx-S^{-\frac{1}{2}}U^T\mu_k) </math><br />
:<math> \, = (x^* - \mu_k^*)^T I(x^* - \mu_k^*) </math> <br /><br /><br />
<br />
where <math> \, x^* = S^{-\frac{1}{2}}U^Tx </math> and <math> \mu_k^* = S^{-\frac{1}{2}}U^T\mu_k </math><br />
<br />
Hence the approach taken should be to transpose point <math>x</math> from the beginning,<br />
<br />
i.e. Let <math> \, x^* = S^{-\frac{1}{2}}U^Tx </math> <br /><br /><br />
<br />
Then compute <math> \, \delta_k </math> and <math> \, \delta_l </math> with <math>x^*</math>, similar to the special case above. <br /><br /><br />
<br />
If the prior distributions of the 2 classes are the same, then this method only requires us to find the distances from the point x to the mean of the 2 classes. We would classify x based on the shortest distance to the mean.<br />
<br />
In the <math>\delta</math> function calculations above, <math>\pi_k = P(X = k)</math> and <math>\pi_l = P(X = l)</math>, and can be approximated using the proportions of <math>k</math> and <math>l</math> elements in the training set.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3668stat341 / CM 3612009-07-29T22:38:15Z<p>Hclam: /* Acceptance/Rejection Method */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. <s>As a result, the probability distribution of each possible number appears to be uniform</s> (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F where f is defined as 0 outside of its domain, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>u_1\dots u_n</math> from <math>F(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
[[Image:HistRandNum.jpg|center|300px|"Histogram showing the expected exponentional distribution" ]]<br />
<br />
<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
[[Image:edit.JPG|thumb|left|500px|Ali: Some statements are incorrect, inaccurate or misleading. Acceptance-Rejection Method needs to be motivated in more details. ]]<br /><br /><br /><br /><br /><br /><br />
<br />
<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult to sample from directly. <br /><br /><br />
<br />
We can apply the Acceptance-Rejection Method by considering a proposal distribution, <math>g(x)</math>, that is easy to sample from, with the condition <br /><br /><br />
<br />
<math> f(x) \leq c \cdot g(x)\ c \in \Re^+,\forall x</math><br />
<br /><br /><br />
[[Image:fxcgx.JPG|thumb|right|300px|"Graph of the pdf of <math>f(x)</math> (target distribution) and <math> c g(x)</math> (proposal distribution)"]]<br />
Then we draw samples Y from <math> c g(x)</math> and accept them with probability given by<br />
<br /><br /><br />
<math> \frac {f(Y)}{c \cdot g(Y)} </math>.<br />
<br /><br /><br />
We would be more likely to reject a sample Y if the ratio above is << 1, which happens when <math>f(Y)</math> and <math>cg(Y)</math> are distant. However, we accept the sample point w/p 1 if <math>f(x)</math> and <math>cg(x)</math> agree when evaluated at the sampled point. This process of accepting and rejecting points will yield a sample distribution that follows the target distribution <math>f(x)</math>.<br />
<br />
<br />
<br />
In the graph,<br /><br />
Sampling y = 7 from <math> cg(x)</math> will yield a sample that follows the target distribution <math>f(x)</math> and will y be accepted w/p 1.<br />
<br />
Sampling y = 9 from <math> cg(x)</math> will yield a point that is distant from <math>f(x)</math> and will be accepted with a low probability.<br />
<br />
'''Proof'''<br />
[[Image:edit.JPG|thumb|left|500px|Ali: Proof of what? . ]]<br /><br /><br /><br /><br /><br /><br />
Show that if points are sampled according to the Acceptance/Rejection method then they follow the target distribution.<br /><br /><br />
<br />
<math> P(X=x|accept) = \frac{P(accept|X=x)P(X=x)}{P(accept)}</math><br /> <br />
by Bayes' theorem<br />
<br /><br /><br />
<math>\begin{align} &P(accept|X=x) = \frac{f(x)}{c \cdot g(x)}\\ &Pr(X=x) = g(x)\frac{}{} \end{align}</math><br /><br />by hypothesis.<br /><br /><br />
<br />
Then,<br /><br />
<math>\begin{align} P(accept) &= \int^{}_x P(accept|X=x)P(X=x) dx \\<br />
&= \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx\\<br />
&= \frac{1}{c} \int^{}_x f(x) dx\\<br />
&= \frac{1}{c} \end{align} </math><br />
<br /><br /><br />
Therefore,<br /><br />
<math> P(X=x|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)}g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
In this lecture we will continue to discuss sampling from specific distributions , introduce '''Monte Carlo Integration''', and also talk about the differences between the Bayesian and Frequentist views on probability, along with references to '''Bayesian Inference'''.<br />
<br />
====Binomial Distribution====<br />
A Binomial distribution <math>X \sim~ Bin(n,p) </math> is the sum of <math>n</math> independent Bernoulli trials, each with probability of success <math>p</math> <math>(0 \leq p \leq 1)</math>. For each trial we generate an independent uniform random variable: <math>U_1, \ldots, U_n \sim~ Unif(0,1)</math>. Then X is the number of times that <math>U_i \leq p</math>. In this case if n is large enough, by the central limit theorem, the Normal distribution can be used to approximate a Binomial distribution.<br />
<br />
Sampling from Binomial distribution in Matlab is done using the following code:<br />
n=3;<br />
p=0.5;<br />
trials=1000;<br />
X=sum((rand(trials,n))'<=p);<br />
hist(X)<br />
<br />
Where the histogram is a Binomial distribution, and for higher <math>n</math>, it would resemble a Normal distribution.<br />
<br />
====Monte Carlo Integration====<br />
<br />
Monte Carlo Integration is a numerical method of approximating the evaluation of integrals using random numbers generated from simulations. In this course we will mainly look at three methods for approximating integrals:<br />
# Basic Monte Carlo Integration<br />
# Importance Sampling<br />
# Markov Chain Monte Carlo (MCMC)<br />
<br />
====Bayesian VS Frequentists====<br />
<br />
During the history of statistics, two major schools of thought emerged along the way and have been locked in an on-going struggle in trying to determine which one has the correct view on probability. These two schools are known as the Bayesian and Frequentist schools of thought. Both the Bayesians and the Frequentists holds a different philosophical view on what defines probability. Below are some fundamental differences between the Bayesian and Frequentist schools of thought:<br />
<br />
'''Frequentist'''<br />
*Probability is '''objective''' and refers to the limit of an event's relative frequency in a large number of trials. For example, a coin with a 50% probability of heads will turn up heads 50% of the time.<br />
*Parameters are all fixed and unknown constants.<br />
*Any statistical process only has interpretations based on limited frequencies. For example, a 95% C.I. of a given parameter will contain the true value of the parameter 95% of the time.<br />
<br />
'''Bayesian'''<br />
*Probability is '''subjective''' and can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can refer to tomorrow's weather as having 50% of rain, whereas this would not make sense to a Frequentist because tomorrow is just one unique event, and cannot be referred to as a relative frequency in a large number of trials.<br />
*Parameters are random variables that has a given distribution, and other probability statements can be made about them.<br />
*Probability has a distribution over the parameters, and point estimates are usually done by either taking the mode or the mean of the distribution. <br />
<br />
====Bayesian Inference====<br />
<br />
'''Example''':<br />
If we have a screen that only displays single digits from 0 to 9, and this screen is split into a 4x5 matrix of pixels, then all together the 20 pixels that make up the screen can be referred to as <math>\vec{X}</math>, which is our data, and the parameter of the data for this case, which we will refer to as <math> \theta </math>, would be a discrete random variable that can take on the values of 0 to 9. In this example, a Bayesian would be interested in finding <math> Pr(\theta=a|\vec{X}=\vec{x})</math>, whereas a Frequentist would be more interested in finding <math> Pr(\vec{X}=\vec{x}|\theta=a)</math><br />
<br />
=====Bayes' Rule=====<br />
<br />
:<math>f(\theta|X) = \frac{f(X | \theta)\, f(\theta)}{f(X)}.</math><br />
<br />
Note: In this case <math>f (\theta|X)</math> is referred to as '''posterior''', <math>f (X | \theta)</math> as '''likelihood''', <math>f (\theta)</math> as '''prior''', and <math>f (X)</math> as the '''marginal''', where <math>\theta</math> is the parameter and <math>X</math> is the observed variable.<br />
<br />
'''Procedure in Bayesian Inference'''<br />
*First choose a probability distribution as the prior, which represents our beliefs about the parameters.<br />
*Then choose a probability distribution for the likelihood, which represents our beliefs about the data.<br />
*Lastly compute the posterior, which represents an update of our beliefs about the parameters after having observed the data.<br />
<br />
As mentioned before, for a Bayesian, finding point estimates usually involves finding the mode or the mean of the parameter's distribution.<br />
<br />
'''Methods'''<br />
*Mode: <math>\theta = \arg\max_{\theta} f(\theta|X) \gets</math> value of <math>\theta</math> that maximizes <math>f(\theta|X)</math><br />
*Mean: <math> \bar\theta = \int^{}_\theta \theta \cdot f(\theta|X)d\theta</math><br />
<br />
If it is the case that <math>\theta</math> is high-dimensional, and we are only interested in one of the components of <math>\theta</math>, for example, we want <math>\theta_1</math> from <math> \vec{\theta}=(\theta_1,\dots,\theta_n)</math>, then we would have to calculate the integral: <math>\int^{} \int^{} \dots \int^{}f(\theta|X)d\theta_2d\theta_3 \dots d\theta_n </math><br />
<br />
This sort of calculation is usually very difficult or not feasible to compute, and thus we would need to do it by simulation.<br />
<br />
'''Note''': <br />
#<math>f(x)=\int^{}_\theta f(X | \theta)f(\theta) d\theta</math> is not a function of <math>\theta</math>, and is called the '''Normalization Factor'''<br />
#Therefore, since f(x) is like a constant, the posterior is proportional to the likelihood times the prior: <math>f(\theta|X)\propto f(X | \theta)f(\theta)</math><br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from <math>f(x)</math>, the 'target' distribution. Choose <math>q(y | x)</math>, the 'proposal' distribution that is easily sampled from.<br /><br /><br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
<br />
<gallery widths="250px" heights="250px"><br />
Image:Blarge.jpg|B too large (B=1000)<br />
Image:Bsmall.JPG|B too small (B=0.001<br />
Image:Bgood.JPG|Good choice of B (B=2)<br />
</gallery><br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
Simulated Annealing is a method of optimization and an application of the Metropolis Hastings Algorithm.<br />
<br />
Consider the problem where we want to find <math>x</math> such that the objective function <math>h(x)</math> is at it's minimum,<br /><br /><br />
<br />
<math>\ \min_{x}(h(x)) </math><br /><br /><br />
<br />
Given a constant T and since the exponential function is monotone, this optimization problem is equivalent to,<br /><br /><br />
<br />
<math>\ \max_{x}(e^{\frac{-h(x)}{T}})</math> <br /><br /><br />
<br />
We consider a distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br /><br />IE. <math>X_{i+1} = Y</math> if <math>U<r</math><br /><br /><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is arbitrarily small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
<gallery widths="250px" heights="250px"><br />
Image:ezplotf1.jpg|T = 100<br />
Image:ezplotf2.jpg|T = 0.1<br />
</gallery><br />
<br />
<br />
In the end, T is small and the region we are trying to sample from becomes sharper. The points that we accept are increasingly close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
In practice the algorithm may get 'stuck' in another local minimum nearby for T too small and we don't get the convergence we looking for.<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1^* \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2^* \sim~ f(x_2|x_1^*, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n^* \sim~ f(x_n|x_1^*, x_2^*, ... , x_{n-1}^*)<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1^*, x_2^*, \ldots, x_n^*) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0,\ \forall x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0,\ \forall <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in Principal Component Analysis).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
[[Image:23plotPCA.jpg]]<br /><br /><br /><br /><br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princomp (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
[[File:Plot1.jpg]]<br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
[[File:Plot2.jpg]]<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
[[File:Plot3.jpg]]<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
figure<br />
subplot(1,2,1)<br />
imagesc(reshape(X(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(X(:,93),8,8)')<br />
[[File:Plot4.jpg]]<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to, depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br> (where <math>\displaystyle \sum_1 and \sum_2 </math> are the covariance matrices for the 1st and 2nd class of data respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br />
The optimization problem we want to solve is,<br />
<br /><br /> <br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2, </math><br />
<br /><br /><br />
<math>\begin{align} (w^T \mu_1 - w^T \mu_2)^2 &= (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) \end{align}</math><br />
<br /><br /><br />
which is a scalar. Therefore,<br />
<br /><br /><br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br /><br /><br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br /><br /><br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br /><br /><br />
Thus our original problem equivalent can be written as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 = \displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br /><br /><br />
For a two class problem the between class variance is,<br />
<br /><br /><br />
<math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math><br />
<br /><br /> <br />
Then this problem can be rewritten as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
# <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br /><br /><br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br /><br />
<math>\displaystyle \frac{(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)} {(w^T(\sum_1 + \sum_2)w)} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max \frac{(w^Ts_Bw)}{(w^Ts_ww)}</math><br />
<br />
<br />
<br />
This is a very famous problem which is called "the generalized eigenvector problem". We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we solve a problem similar to that in PCA,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math> <br /><br />
subject to <math>\displaystyle (w^Ts_Ww=1)</math> <br />
<br /><br /><br />
We solve the following Lagrange Multiplier problem,<br />
<br /><br /><br />
<math>\displaystyle L(w,\lambda) = w^Ts_Bw - \lambda (w^Ts_Ww -1) </math><br /><br /><br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
As discussed in the previous lecture, our Optimization problem for FDA is:<br />
<br />
<math><br />
\max_{w} \frac{w^T s_B w}{w^T s_w w}<br />
</math><br />
<br />
which we turned into<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />Subject to: <br />
<math>\displaystyle (w^Ts_ww=1)</math> <br />
<br />
where <math>s_B</math> is the covariance matrix between classes and <math>s_w</math> is the covariance matrix within classes.<br />
<br />
Using Lagrange multipliers, we have a Partial solution to: <br />
<math>\displaystyle (w^Ts_Bw) - \lambda \cdot [(w^Ts_ww)-1] </math><br />
<br />
- The optimal solution for w is the eigenvector of <br />
<math>\displaystyle s_w^{-1}s_B </math> <br />
corresponding to the largest eigenvalue;<br />
<br />
- For a k class problem, we will take the eigenvectors corresponding to the (k-1) highest eigenvalues. <br />
<br />
- In the case of two-class problem, the optimal solution for w can be simplfied, such that:<br />
<math>\displaystyle w \propto s_w^{-1}(\mu_2 - \mu_1) </math> <br />
<br />
=====Example:=====<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods. <br />
%First of all, we generate the two data set:<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300) <br />
%In this case: mu_1=[1;1]; Sigma_1=[1 1.5; 1.5 3], where mu and sigma are the mean and covariance matrix.<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300) <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
X=[X1,X2]<br />
X=X'<br />
[coefs, scores]=princomp(X');<br />
coefs(:,1) %first principal component<br />
coefs(:,1)<br />
<br />
ans =<br />
0.76355476446932<br />
0.64574307712603<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=sw\[4; 2] % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is no overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, <math>Y</math>, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable <math>X</math> is the image. (In the case in which Y is a continuous variable, classification is an application of regression.)<br />
<br />
Consider independent and identically distributed data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> where <math> \displaystyle x_i \in X \subset \mathbb{R}^d </math> and <math> y_i \in Y</math> and Y is a finite set of discrete values. The set of data points <math> \displaystyle \{(x_1,y_1),...,(x_N,y_N)\} </math> is called the ''training set.''<br />
<br />
Then <math>\displaystyle h(x)</math> is a classifier where, given a new data point <math> \displaystyle x</math>, <math> \displaystyle h(x)</math> predicts <math> \displaystyle y</math>. The function <math> \displaystyle h</math> is found using the training set. i.e. the set ''trains'' <math> \displaystyle h</math> to map <math> \displaystyle X</math> to <math> \displaystyle Y</math>:<br />
<br />
<math>\, y = h(x)</math><br />
<br />
<math>\, h: X \to Y</math><br />
<br />
<br />
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.<br />
<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we find the emperical error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide by the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
Error rate is also called misclassification rate, and 1 minus the error rate is sometimes called the classification rate.<br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = \{0,1\} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator can be written as <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered to be the best classifier in terms of error rate. A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math>. A real-life interpretation of the marginal<math>\, Pr(Y=1 \mid X=x)</math> may even deal with patterns and meaning, which provides an extra challenge in finding a mathematical interpretation.<br />
[[Image:Set Differentiation.jpg|right]]<br />
In the image, which set would it be more appropriate for the question mark to belong to?<br />
<br />
<br />
This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques. One such technique is finding the Decision Boundary.<br />
<br />
=== Decision Boundary ===<br />
<br />
[[Image:Decision_boundary_Joanna.jpg|thumb|right|250px]]<br />
<br />
The Decision boundary is given by:<br />
<br />
<math>\, Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x) </math><br />
<br />
Suggesting those points where the probabilities of being in both classes are identical. Thus,<br />
<br />
<br />
<math>\, D: \{ x \mid Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x)\} </math><br />
<br /><br /><br />
Linear discriminant analysis has a decision boundary represented by a linear function, while quadratic discriminant analysis has a decision boundary represented by a quadratic function.<br />
<br /><br /><br /><br />
<br />
=== Linear Discriminant Analysis(LDA) - July 23===<br />
==== Motivation ====<br />
[[Image:LDAmulti.jpg|thumb|right|250px|"LDA decision boundary for 2 classes of multivariate normal data"]]<br />
<br />
We would like to apply Bayes Classifiaction rule by approximating the class conditional density <br />
<math>\, f_k(x) </math> and the prior <math>\, \pi_k </math><br />
<math>P(Y=k|X=x) = \frac{f_k(x)\pi_k}{\sum_{\forall{k}}f_k(x_k)\pi_k}</math><br />
<br /><br /><br />
<br />
By making the following assumptions we can find a linear approximation to the boundary given by Bayes rule,<br />
#The class conditional density is multivariate gaussian<br />
#The classes have a common covariance matrix<br />
<br />
==== Derivation ====<br />
<br />
:'''Note on Quadratic Form'''<br /><br /><br />
: <math>\, (x + a)^TA(x+b) = x^TAx + a^TAb + x^TAb + a^TAx </math><br />
<br />
<br />
<br />
By assumption (1),<br /><br /><br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}</math><br /><br /><br />
<br />
where <math>\, \Sigma_k </math> is the class covariance matrix and <math>\, \mu_k </math> is the class mean. By definition of the decision boundary (decision boundary between class <math>\ k </math> and class <math>\ l </math>),<br /><br /><br />
<math>\begin{align}&P(Y=k|X=x) = P(Y=l|X=x)\\ \\ &\frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}\pi_k = \frac{1}{(2\pi)^{d/2}|\Sigma_l|^{1/2}}e^{-\frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l)}\pi_l\end{align} </math><br />
<br /><br /><br />
By assumption (2),<br /><br /><br />
<br />
<math>\begin{align}& \Sigma_k = \Sigma_l = \Sigma\\ \\ & e^{-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)}\pi_k = e^{-\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l)} \pi_l \\ & -\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k) + \log(\pi_k) = -\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l) + \log(\pi_l) \\ & -\frac{1}{2}(x^T\Sigma^{-1}x + \mu_k^T\Sigma^{-1}\mu_k - 2x^T\mu_k) + \frac{1}{2}(x^T\Sigma^{-1}x + \mu_l^T\Sigma^{-1}\mu_l - 2x^T\mu_l) + \log{\frac{\pi_k}{\pi_l}} = 0\ \\ \\ & x^T(\mu_k - \mu_l) + \frac{1}{2}(\mu_l - \mu_k)^T\Sigma^{-1}(\mu_l + \mu_k) + \log{\frac{\pi_k}{\pi_l}} = 0 \end{align}</math><br />
<br /><br /><br />
The result is a linear function of <math>\ x </math> of the form <math>\, x^Ta + b = 0 </math>.<br />
<br />
=====Example:=====<br />
<br />
%Load data set<br />
load 2_3;<br />
[coefs, scores] = princomp(X');<br />
size(X)<br />
% ans = 64 400 % 64 principal components<br />
size(coefs)<br />
% ans = 64 64 <br />
size(scores)<br />
% ans = 400 64<br />
Y=scores(:, 1:2);<br />
% just use two of the 64 principal components<br />
plot(Y(1:200, 1),Y(1:200, 2), 'b.')<br />
hold on<br />
plot(Y(201:400, 1),Y(201:400, 2), 'r.')<br />
<br />
[[File:Pca_2.jpg]]<br />
<br />
ll=[zeros(200,1),ones(200,1)];<br />
[C,err,P,logp,coeff] = classify(Y, Y, ll', 'linear');<br />
<br />
==== Computational Method ====<br />
<br />
We can implement this computationally by the following:<br />
<br />
Define two variables, <math>\, \delta_k </math> and <math>\, \delta_l </math><br />
<br />
<math> \,\delta_k = log(f_k(x)\pi_k) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_k|) </math><br />
<br />
<math> \,\delta_l = log(f_l(x)\pi_l) = log (\pi_l) - \frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_l|) </math><br />
<br />
<br />
<br />
To classify a point, <math>\ x </math>, first compute <math>\, \delta_k </math> and <math>\, \delta_l </math>. <br /><br /><br />
Classify it to class <math>\ k </math> if <math>\, \delta_k > \delta_l </math> and vise versa. <br /><br /><br />
<br />
:<math><br />
h(x) = \begin{cases}<br />
k, & \text{if } \delta_k > \delta_l \\<br />
l, & otherwise\\<br />
\end{cases}</math><br />
<br />
(note: since <math> - \frac{d}{2}log(2\pi) </math> is a constant term, we can simply ignore it in the actual computation since it will cancel out when we do the comparison of the deltas.) <br /><br /><br />
<br />
'''Special Case: <math>\, \Sigma_k = I </math>''', the identity matrix. Then,<br />
<br />
<math> \,\delta_k = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k) - \frac{1}{2}log(|I|) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k) </math><br />
<br />
We see that in the case <math>\, \Sigma = I </math>, we can simply classify a point, <math>\ x </math>, to a class based on the distances between <math>\ x </math> and the mean of the different classes (adjusted with the log of the prior). <br /><br /><br />
<br />
'''General Case: <math>\, \Sigma_k \ne I </math>''' <br /><br /><br />
<br />
<math> \, \Sigma_k = USV^T = USU^T </math> (since <math>\ \Sigma </math> is symmetric)<br /><br /><br />
<math> \, \Sigma_k^{-1} = (USU^T)^{-1} = (U^T)^{-1}S^{-1}U^{-1} = US^{-1}U^T </math><br /><br /><br />
<br />
So, <math> (x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) </math><br />
:<math> \, = (x-\mu_k)^T US^{-1}U^T(x-\mu_k) </math><br />
:<math> \, = (U^Tx-U^T\mu_k)^T S^{-1}(U^Tx-U^T\mu_k) </math><br />
:<math> \, = (U^Tx-U^T\mu_k)^T S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^Tx-U^T\mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^Tx-S^{-\frac{1}{2}}U^T\mu_k)^T I(S^{-\frac{1}{2}}U^Tx-S^{-\frac{1}{2}}U^T\mu_k) </math><br />
:<math> \, = (x^* - \mu_k^*)^T I(x^* - \mu_k^*) </math> <br /><br /><br />
<br />
where <math> \, x^* = S^{-\frac{1}{2}}U^Tx </math> and <math> \mu_k^* = S^{-\frac{1}{2}}U^T\mu_k </math><br />
<br />
Hence the approach taken should be to transpose point <math>x</math> from the beginning,<br />
<br />
i.e. Let <math> \, x^* = S^{-\frac{1}{2}}U^Tx </math> <br /><br /><br />
<br />
Then compute <math> \, \delta_k </math> and <math> \, \delta_l </math> with <math>x^*</math>, similar to the special case above. <br /><br /><br />
<br />
If the prior distributions of the 2 classes are the same, then this method only requires us to find the distances from the point x to the mean of the 2 classes. We would classify x based on the shortest distance to the mean.<br />
<br />
In the <math>\delta</math> function calculations above, <math>\pi_k = P(X = k)</math> and <math>\pi_l = P(X = l)</math>, and can be approximated using the proportions of <math>k</math> and <math>l</math> elements in the training set.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3443stat341 / CM 3612009-07-24T03:10:43Z<p>Hclam: /* Computational Method */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F where f is defined as 0 outside of its domain, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>u_1\dots u_n</math> from <math>F(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
[[Image:HistRandNum.jpg|center|300px|"Histogram showing the expected exponentional distribution" ]]<br />
<br />
<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />[[Image:fxcgx.JPG|thumb|right|300px|"Graph of the pdf of <math>f(x)</math> (target distribution) and <math> c g(x)</math> (proposal distribution)"]]<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant c such that,<br /><br /><br />
<math> c \cdot g(x) \geq f(x)\ \forall x</math><br />
<br /><br /><br />
accepting samples drawn in succession from <math> c \cdot g(x)</math> where<br />
<br /><br /><br />
<math> \frac {f(x)}{c \cdot g(x)} </math> close to 1,<br />
<br /><br /><br />
will yield a sample that follows the target distribution <math>f(x)</math>; we would reject the samples if the ratio is not close to 1.<br />
<br />
<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in Principal Component Analysis).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princomp (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
[[File:Plot1.jpg]]<br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
[[File:Plot2.jpg]]<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
[[File:Plot3.jpg]]<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
figure<br />
subplot(1,2,1)<br />
imagesc(reshape(X(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(X(:,93),8,8)')<br />
[[File:Plot4.jpg]]<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to, depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br> (where <math>\displaystyle \sum_1 and \sum_2 </math> are the covariance matrices for the 1st and 2nd class of data respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br />
The optimization problem we want to solve is,<br />
<br /><br /> <br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2, </math><br />
<br /><br /><br />
<math>\begin{align} (w^T \mu_1 - w^T \mu_2)^2 &= (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) \end{align}</math><br />
<br /><br /><br />
which is a scalar. Therefore,<br />
<br /><br /><br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br /><br /><br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br /><br /><br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br /><br /><br />
Thus our original problem equivalent can be written as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 = \displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br /><br /><br />
For a two class problem the between class variance is,<br />
<br /><br /><br />
<math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math><br />
<br /><br /> <br />
Then this problem can be rewritten as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
# <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br /><br /><br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br /><br />
<math>\displaystyle \frac{(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)} {(w^T(\sum_1 + \sum_2)w)} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max \frac{(w^Ts_Bw)}{(w^Ts_ww)}</math><br />
<br />
<br />
<br />
This is a very famous problem which is called "the generalized eigenvector problem". We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we solve a problem similar to that in PCA,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math> <br /><br />
subject to <math>\displaystyle (w^Ts_Ww=1)</math> <br />
<br /><br /><br />
We solve the following Lagrange Multiplier problem,<br />
<br /><br /><br />
<math>\displaystyle L(w,\lambda) = w^Ts_Bw - \lambda (w^Ts_Ww -1) </math><br /><br /><br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
As discussed in the previous lecture, our Optimization problem for FDA is:<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />Subject to: <br />
<math>\displaystyle (w^Ts_ww=1)</math> <br />
<br />
<br />
Using Lagrange multipliers, we have a Partial solution to: <br />
<math>\displaystyle (w^Ts_Bw) - \lambda * [(w^Ts_ww)-1] </math><br />
<br />
- The optimal solution for w is the eigenvector of <br />
<math>\displaystyle s_w^{-1}s_B </math> <br />
corresponding to the largest eigenvalue;<br />
<br />
- For a k class problem, we will take the eigenvectors corresponding to the (k-1) highest eigenvalues. <br />
<br />
- In the case of two-class problem, the optimal solution for w can be simplfied, such that:<br />
<math>\displaystyle w \propto s_w^{-1}(\mu_2 - \mu_1) </math> <br />
<br />
=====Example:=====<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two method. <br />
%First of all, we generate the two data set:<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300) <br />
%In this case: mu_1=[1;1]; Sigma_1=[1 1.5; 1.5 3], where mu and sigma are the mean and covariance matrix.<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300) <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
X=[X1,X2]<br />
X=X'<br />
[coefs, scores]=princomp(X');<br />
coefs(:,1) %first principal component<br />
coefs(:,1)<br />
<br />
ans =<br />
0.76355476446932<br />
0.64574307712603<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=inv(sw)*[4 2]' % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is no overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, <math>Y</math>, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable <math>X</math> is the image. (In the case in which Y is a continuous variable, classification is an application of regression.)<br />
<br />
Consider independent and identically distributed data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> where <math> \displaystyle x_i \in X \subset \mathbb{R}^d </math> and <math> y_i \in Y</math> and Y is a finite set of discrete values. The set of data points <math> \displaystyle \{(x_1,y_1),...,(x_N,y_N)\} </math> is called the ''training set.''<br />
<br />
Then <math>\displaystyle h(x)</math> is a classifier where, given a new data point <math> \displaystyle x</math>, <math> \displaystyle h(x)</math> predicts <math> \displaystyle y</math>. The function <math> \displaystyle h</math> is found using the training set. i.e. the set ''trains'' <math> \displaystyle h</math> to map <math> \displaystyle X</math> to <math> \displaystyle Y</math>:<br />
<br />
<math>\, y = h(x)</math><br />
<br />
<math>\, h: X \to Y</math><br />
<br />
<br />
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.<br />
<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we how can we find the error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide bu the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = \{0,1\} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered the best classifier in terms of error rate. A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math>. This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques. One such technique is finding the Decision Boundary.<br />
<br />
<br />
=== Decision Boundary ===<br />
<br />
[[Image:Decision_boundary_Joanna.jpg|thumb|right|250px]]<br />
<br />
The Decision boundary is given by:<br />
<br />
<math>\, Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x) </math><br />
<br />
Suggesting those points where the probabilities of being in both classes are identical. Thus,<br />
<br />
<br />
<math>\, D: \{ x \mid Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x)\} </math><br />
<br /><br /><br />
Linear discriminant analysis has a linear decision boundary while quadratic discriminant analysis has a decision boundary represented by a quadratic function.<br />
<br /><br /><br /><br />
<br />
=== Linear Discriminant Analysis(LDA) - July 23===<br />
==== Motivation ====<br />
[[Image:LDAmulti.jpg|thumb|right|250px|"LDA decision boundary for 2 classes of multivariate normal data"]]<br />
We would like to apply Bayes Classifiaction rule by approximating the class conditional density <math>f_k(x)</math> and the prior <math>\pi_k</math> where,<br /><br /><br />
<math>P(Y=k|X=x) = \frac{f_k(x)\pi_k}{\sum_{\forall{k}}f_k(x_k)\pi_k}</math><br />
<br /><br /><br />
<br />
By making the following assumptions we can find a linear approximation to the boundary given by Bayes rule,<br />
#The class conditional density is multivariate gaussian<br />
#The classes have a common covariance matrix<br />
<br />
==== Derivation ====<br />
<br />
:'''Note on Quadratic Form'''<br /><br /><br />
: <math>(x + a)^TA(x+b) = x^TAx + a^TAb + x^Tb + x^Ta</math><br />
<br />
<br />
<br />
By assumption (1),<br /><br /><br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}</math><br /><br /><br />
<br />
where <math>\Sigma_k</math> is the class covariance matrix and <math>\mu_k</math> is the class mean. By definition of the decision boundary,<br /><br /><br />
<math>\begin{align}&P(Y=k|X=x) = P(Y=l|X=x)\\ \\ &\frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}\pi_k = \frac{1}{(2\pi)^{d/2}|\Sigma_l|^{1/2}}e^{-\frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l)}\pi_l\end{align} </math><br />
<br /><br /><br />
By assumption (2),<br /><br /><br />
<br />
<math>\begin{align}& \Sigma_k = \Sigma_l = \Sigma\\ \\ & e^{-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)}\pi_k = e^{-\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l)} \pi_l \\ & -\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k) + \log(\pi_k) = -\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l) + \log(\pi_l) \\ & -\frac{1}{2}(x^T\Sigma^{-1}x + \mu_k^T\Sigma^{-1}\mu_k - 2x^T\mu_k) + \frac{1}{2}(x^T\Sigma^{-1}x + \mu_l^T\Sigma^{-1}\mu_l - 2x^T\mu_l) + \log{\frac{\pi_k}{\pi_l}} = 0\ \\ \\ & x^T(u_k - u_l) + \frac{1}{2}(u_l - u_k)^T\Sigma^{-1}(u_l + u_k) + \log{\frac{\pi_k}{\pi_l}} = 0 \end{align}</math><br />
<br /><br /><br />
The result is a linear function of <math>x</math> of the form <math>ax^T + b = 0</math>.<br />
<br />
==== Computational Method ====<br />
<br />
We can implement this computationally by the following:<br />
<br />
Define two variables, <math>\, \delta_k </math> and <math>\, \delta_l </math><br />
<br />
<math> \,\delta_k = log(f_k(x)\pi_k) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_k|) </math><br />
<br />
<math> \,\delta_l = log(f_l(x)\pi_l) = log (\pi_l) - \frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_l|) </math><br />
<br />
<br />
<br />
To classify a point, x, first compute <math>\, \delta_k </math> and <math>\, \delta_l </math>. <br /><br /><br />
Classify it to class k if <math>\, \delta_k > \delta_l </math> and vise versa. <br /><br /><br />
<br />
:<math><br />
h(x) = \begin{cases}<br />
k, & \text{if } \delta_k > \delta_l \\<br />
l, & otherwise\\<br />
\end{cases}</math><br />
<br />
(note: since <math> - \frac{d}{2}log(2\pi) </math> is a constant term, we can simply ignore it in the actual computation since it will cancel out when we do the comparison of the deltas.) <br /><br /><br />
<br />
Consider a '''special case''': <math>\, \Sigma_k = I </math>, the identity matrix. Then,<br />
<br />
<math> \,\delta_k = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k) - \frac{1}{2}log(|I|) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k) </math><br />
<br />
We see that in the case <math> \Sigma = I </math>, we can simply classify a point, x, to a class based on the distances between x and the mean of the different classes (adjusted with the log of the prior). <br /><br /><br />
<br />
In the case that <math>\, \Sigma_k \ne I </math>, we do, <br /><br /><br />
<br />
<math> \, \Sigma_k = USV^T = USU^T </math> (since <math> \Sigma </math> is symmetric)<br /><br /><br />
<math> \, \Sigma_k^{-1} = (USU^T)^{-1} = U^{-1}S^{-1}(U^T)^{-1} = US^{-1}U^T </math><br /><br /><br />
<br />
So, <math> (x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) </math><br />
:<math> \, = (x-\mu_k)^T US^{-1}U^T(x-\mu_k) </math><br />
:<math> \, = (U^Tx-U^T\mu_k)^T S^{-1}(U^Tx-U^T\mu_k) </math><br />
:<math> \, = (U^Tx-U^T\mu_k)^T S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^Tx-U^T\mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^Tx-S^{-\frac{1}{2}}U^T\mu_k)^T I(S^{-\frac{1}{2}}U^Tx-S^{-\frac{1}{2}}U^T\mu_k) </math><br />
:<math> \, = (x^* - \mu_k^*)I(x^* - \mu_k^*) </math> <br /><br /><br />
<br />
where <math> \, x^* = S^{-\frac{1}{2}}U^Tx </math> and <math> \mu_k^* = S^{-\frac{1}{2}}U^T\mu_k </math><br />
<br />
Hence the approach taken should be to transpose point x from the beginning,<br />
<br />
i.e. Let <math> \, x^* = S^{-\frac{1}{2}}U^Tx </math> <br /><br /><br />
<br />
Then compute <math> \, \delta_k </math> and <math> \, \delta_l </math> with x^*. <br /><br /><br />
<br />
If it is the case that the priors of the 2 classes are the same, then this method only requires us to find the distances from point x to the mean of the 2 classes.</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3442stat341 / CM 3612009-07-24T03:06:35Z<p>Hclam: /* Computational Method */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F where f is defined as 0 outside of its domain, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>u_1\dots u_n</math> from <math>F(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
[[Image:HistRandNum.jpg|center|300px|"Histogram showing the expected exponentional distribution" ]]<br />
<br />
<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />[[Image:fxcgx.JPG|thumb|right|300px|"Graph of the pdf of <math>f(x)</math> (target distribution) and <math> c g(x)</math> (proposal distribution)"]]<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant c such that,<br /><br /><br />
<math> c \cdot g(x) \geq f(x)\ \forall x</math><br />
<br /><br /><br />
accepting samples drawn in succession from <math> c \cdot g(x)</math> where<br />
<br /><br /><br />
<math> \frac {f(x)}{c \cdot g(x)} </math> close to 1,<br />
<br /><br /><br />
will yield a sample that follows the target distribution <math>f(x)</math>; we would reject the samples if the ratio is not close to 1.<br />
<br />
<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in Principal Component Analysis).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princomp (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
[[File:Plot1.jpg]]<br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
[[File:Plot2.jpg]]<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
[[File:Plot3.jpg]]<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
figure<br />
subplot(1,2,1)<br />
imagesc(reshape(X(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(X(:,93),8,8)')<br />
[[File:Plot4.jpg]]<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to, depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br> (where <math>\displaystyle \sum_1 and \sum_2 </math> are the covariance matrices for the 1st and 2nd class of data respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br />
The optimization problem we want to solve is,<br />
<br /><br /> <br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2, </math><br />
<br /><br /><br />
<math>\begin{align} (w^T \mu_1 - w^T \mu_2)^2 &= (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) \end{align}</math><br />
<br /><br /><br />
which is a scalar. Therefore,<br />
<br /><br /><br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br /><br /><br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br /><br /><br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br /><br /><br />
Thus our original problem equivalent can be written as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 = \displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br /><br /><br />
For a two class problem the between class variance is,<br />
<br /><br /><br />
<math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math><br />
<br /><br /> <br />
Then this problem can be rewritten as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
# <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br /><br /><br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br /><br />
<math>\displaystyle \frac{(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)} {(w^T(\sum_1 + \sum_2)w)} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max \frac{(w^Ts_Bw)}{(w^Ts_ww)}</math><br />
<br />
<br />
<br />
This is a very famous problem which is called "the generalized eigenvector problem". We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we solve a problem similar to that in PCA,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math> <br /><br />
subject to <math>\displaystyle (w^Ts_Ww=1)</math> <br />
<br /><br /><br />
We solve the following Lagrange Multiplier problem,<br />
<br /><br /><br />
<math>\displaystyle L(w,\lambda) = w^Ts_Bw - \lambda (w^Ts_Ww -1) </math><br /><br /><br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
As discussed in the previous lecture, our Optimization problem for FDA is:<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />Subject to: <br />
<math>\displaystyle (w^Ts_ww=1)</math> <br />
<br />
<br />
Using Lagrange multipliers, we have a Partial solution to: <br />
<math>\displaystyle (w^Ts_Bw) - \lambda * [(w^Ts_ww)-1] </math><br />
<br />
- The optimal solution for w is the eigenvector of <br />
<math>\displaystyle s_w^{-1}s_B </math> <br />
corresponding to the largest eigenvalue;<br />
<br />
- For a k class problem, we will take the eigenvectors corresponding to the (k-1) highest eigenvalues. <br />
<br />
- In the case of two-class problem, the optimal solution for w can be simplfied, such that:<br />
<math>\displaystyle w \propto s_w^{-1}(\mu_2 - \mu_1) </math> <br />
<br />
=====Example:=====<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two method. <br />
%First of all, we generate the two data set:<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300) <br />
%In this case: mu_1=[1;1]; Sigma_1=[1 1.5; 1.5 3], where mu and sigma are the mean and covariance matrix.<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300) <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
X=[X1,X2]<br />
X=X'<br />
[coefs, scores]=princomp(X');<br />
coefs(:,1) %first principal component<br />
coefs(:,1)<br />
<br />
ans =<br />
0.76355476446932<br />
0.64574307712603<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=inv(sw)*[4 2]' % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is no overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, <math>Y</math>, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable <math>X</math> is the image. (In the case in which Y is a continuous variable, classification is an application of regression.)<br />
<br />
Consider independent and identically distributed data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> where <math> \displaystyle x_i \in X \subset \mathbb{R}^d </math> and <math> y_i \in Y</math> and Y is a finite set of discrete values. The set of data points <math> \displaystyle \{(x_1,y_1),...,(x_N,y_N)\} </math> is called the ''training set.''<br />
<br />
Then <math>\displaystyle h(x)</math> is a classifier where, given a new data point <math> \displaystyle x</math>, <math> \displaystyle h(x)</math> predicts <math> \displaystyle y</math>. The function <math> \displaystyle h</math> is found using the training set. i.e. the set ''trains'' <math> \displaystyle h</math> to map <math> \displaystyle X</math> to <math> \displaystyle Y</math>:<br />
<br />
<math>\, y = h(x)</math><br />
<br />
<math>\, h: X \to Y</math><br />
<br />
<br />
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.<br />
<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we how can we find the error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide bu the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = \{0,1\} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered the best classifier in terms of error rate. A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math>. This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques. One such technique is finding the Decision Boundary.<br />
<br />
<br />
=== Decision Boundary ===<br />
<br />
[[Image:Decision_boundary_Joanna.jpg|thumb|right|250px]]<br />
<br />
The Decision boundary is given by:<br />
<br />
<math>\, Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x) </math><br />
<br />
Suggesting those points where the probabilities of being in both classes are identical. Thus,<br />
<br />
<br />
<math>\, D: \{ x \mid Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x)\} </math><br />
<br /><br /><br />
Linear discriminant analysis has a linear decision boundary while quadratic discriminant analysis has a decision boundary represented by a quadratic function.<br />
<br /><br /><br /><br />
<br />
=== Linear Discriminant Analysis(LDA) - July 23===<br />
==== Motivation ====<br />
[[Image:LDAmulti.jpg|thumb|right|250px|"LDA decision boundary for 2 classes of multivariate normal data"]]<br />
We would like to apply Bayes Classifiaction rule by approximating the class conditional density <math>f_k(x)</math> and the prior <math>\pi_k</math> where,<br /><br /><br />
<math>P(Y=k|X=x) = \frac{f_k(x)\pi_k}{\sum_{\forall{k}}f_k(x_k)\pi_k}</math><br />
<br /><br /><br />
<br />
By making the following assumptions we can find a linear approximation to the boundary given by Bayes rule,<br />
#The class conditional density is multivariate gaussian<br />
#The classes have a common covariance matrix<br />
<br />
==== Derivation ====<br />
<br />
:'''Note on Quadratic Form'''<br /><br /><br />
: <math>(x + a)^TA(x+b) = x^TAx + a^TAb + x^Tb + x^Ta</math><br />
<br />
<br />
<br />
By assumption (1),<br /><br /><br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}</math><br /><br /><br />
<br />
where <math>\Sigma_k</math> is the class covariance matrix and <math>\mu_k</math> is the class mean. By definition of the decision boundary,<br /><br /><br />
<math>\begin{align}&P(Y=k|X=x) = P(Y=l|X=x)\\ \\ &\frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}\pi_k = \frac{1}{(2\pi)^{d/2}|\Sigma_l|^{1/2}}e^{-\frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l)}\pi_l\end{align} </math><br />
<br /><br /><br />
By assumption (2),<br /><br /><br />
<br />
<math>\begin{align}& \Sigma_k = \Sigma_l = \Sigma\\ \\ & e^{-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)}\pi_k = e^{-\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l)} \pi_l \\ & -\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k) + \log(\pi_k) = -\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l) + \log(\pi_l) \\ & -\frac{1}{2}(x^T\Sigma^{-1}x + \mu_k^T\Sigma^{-1}\mu_k - 2x^T\mu_k) + \frac{1}{2}(x^T\Sigma^{-1}x + \mu_l^T\Sigma^{-1}\mu_l - 2x^T\mu_l) + \log{\frac{\pi_k}{\pi_l}} = 0\ \\ \\ & x^T(u_k - u_l) + \frac{1}{2}(u_l - u_k)^T\Sigma^{-1}(u_l + u_k) + \log{\frac{\pi_k}{\pi_l}} = 0 \end{align}</math><br />
<br /><br /><br />
The result is a linear function of <math>x</math> of the form <math>ax^T + b = 0</math>.<br />
<br />
==== Computational Method ====<br />
<br />
We can implement this computationally by the following:<br />
<br />
Define two variables, <math>\, \delta_k </math> and <math>\, \delta_l </math><br />
<br />
<math> \,\delta_k = log(f_k(x)\pi_k) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_k|) </math><br />
<br />
<math> \,\delta_l = log(f_l(x)\pi_l) = log (\pi_l) - \frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_l|) </math><br />
<br />
<br />
<br />
To classify a point, x, first compute <math>\, \delta_k </math> and <math>\, \delta_l </math>. <br /><br /><br />
Classify it to class k if <math>\, \delta_k > \delta_l </math> and vise versa. <br /><br /><br />
<br />
:<math><br />
h(x) = \begin{cases}<br />
k, & \text{if } \delta_k > \delta_l \\<br />
l, & otherwise\\<br />
\end{cases}</math><br />
<br />
(note: since <math> - \frac{d}{2}log(2\pi) </math> is a constant term, we can simply ignore it in the actual computation since it will cancel out when we do the comparison of the deltas.) <br /><br /><br />
<br />
Consider a '''special case''': <math>\, \Sigma_k = I </math>, the identity matrix. Then,<br />
<br />
<math> \,\delta_k = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k) - \frac{1}{2}log(|I|) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k) </math><br />
<br />
We see that in the case <math> \Sigma = I </math>, we can simply classify a point, x, to a class based on the distances between x and the mean of the different classes (adjusted with the log of the prior). <br /><br /><br />
<br />
In the case that <math>\, \Sigma_k \ne I </math>, we do, <br /><br /><br />
<br />
<math> \, \Sigma_k = USV^T = USU^T </math> (since <math> \Sigma </math> is symmetric)<br /><br /><br />
<math> \, \Sigma_k^{-1} = (USU^T)^{-1} = U^{-1}S^{-1}(U^T)^{-1} = US^{-1}U^T </math><br /><br /><br />
<br />
So, <math> (x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) </math><br />
:<math> \, = (x-\mu_k)^T US^{-1}U^T(x-\mu_k) </math><br />
:<math> \, = (U^Tx-U^T\mu_k)^T S^{-1}(U^Tx-U^T\mu_k) </math><br />
:<math> \, = (U^Tx-U^T\mu_k)^T S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^Tx-U^T\mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^Tx-S^{-\frac{1}{2}}U^T\mu_k)^T I(S^{-\frac{1}{2}}U^Tx-S^{-\frac{1}{2}}U^T\mu_k) </math><br />
:<math> \, = (x^* - \mu_k^*)I(x^* - \mu_k^*) </math> <br /><br /><br />
<br />
where <math> \, x^* = S^{-\frac{1}{2}}U^Tx </math> and <math> \mu_k^* = S^{-\frac{1}{2}}U^T\mu_k </math></div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3441stat341 / CM 3612009-07-24T02:54:09Z<p>Hclam: /* Linear Discriminant Analysis(LDA) - July 23 */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F where f is defined as 0 outside of its domain, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>u_1\dots u_n</math> from <math>F(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
[[Image:HistRandNum.jpg|center|300px|"Histogram showing the expected exponentional distribution" ]]<br />
<br />
<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />[[Image:fxcgx.JPG|thumb|right|300px|"Graph of the pdf of <math>f(x)</math> (target distribution) and <math> c g(x)</math> (proposal distribution)"]]<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant c such that,<br /><br /><br />
<math> c \cdot g(x) \geq f(x)\ \forall x</math><br />
<br /><br /><br />
accepting samples drawn in succession from <math> c \cdot g(x)</math> where<br />
<br /><br /><br />
<math> \frac {f(x)}{c \cdot g(x)} </math> close to 1,<br />
<br /><br /><br />
will yield a sample that follows the target distribution <math>f(x)</math>; we would reject the samples if the ratio is not close to 1.<br />
<br />
<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in Principal Component Analysis).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princomp (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
[[File:Plot1.jpg]]<br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
[[File:Plot2.jpg]]<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
[[File:Plot3.jpg]]<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
figure<br />
subplot(1,2,1)<br />
imagesc(reshape(X(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(X(:,93),8,8)')<br />
[[File:Plot4.jpg]]<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to, depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br> (where <math>\displaystyle \sum_1 and \sum_2 </math> are the covariance matrices for the 1st and 2nd class of data respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br />
The optimization problem we want to solve is,<br />
<br /><br /> <br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2, </math><br />
<br /><br /><br />
<math>\begin{align} (w^T \mu_1 - w^T \mu_2)^2 &= (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) \end{align}</math><br />
<br /><br /><br />
which is a scalar. Therefore,<br />
<br /><br /><br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br /><br /><br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br /><br /><br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br /><br /><br />
Thus our original problem equivalent can be written as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 = \displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br /><br /><br />
For a two class problem the between class variance is,<br />
<br /><br /><br />
<math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math><br />
<br /><br /> <br />
Then this problem can be rewritten as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
# <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br /><br /><br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br /><br />
<math>\displaystyle \frac{(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)} {(w^T(\sum_1 + \sum_2)w)} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max \frac{(w^Ts_Bw)}{(w^Ts_ww)}</math><br />
<br />
<br />
<br />
This is a very famous problem which is called "the generalized eigenvector problem". We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we solve a problem similar to that in PCA,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math> <br /><br />
subject to <math>\displaystyle (w^Ts_Ww=1)</math> <br />
<br /><br /><br />
We solve the following Lagrange Multiplier problem,<br />
<br /><br /><br />
<math>\displaystyle L(w,\lambda) = w^Ts_Bw - \lambda (w^Ts_Ww -1) </math><br /><br /><br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
As discussed in the previous lecture, our Optimization problem for FDA is:<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />Subject to: <br />
<math>\displaystyle (w^Ts_ww=1)</math> <br />
<br />
<br />
Using Lagrange multipliers, we have a Partial solution to: <br />
<math>\displaystyle (w^Ts_Bw) - \lambda * [(w^Ts_ww)-1] </math><br />
<br />
- The optimal solution for w is the eigenvector of <br />
<math>\displaystyle s_w^{-1}s_B </math> <br />
corresponding to the largest eigenvalue;<br />
<br />
- For a k class problem, we will take the eigenvectors corresponding to the (k-1) highest eigenvalues. <br />
<br />
- In the case of two-class problem, the optimal solution for w can be simplfied, such that:<br />
<math>\displaystyle w \propto s_w^{-1}(\mu_2 - \mu_1) </math> <br />
<br />
=====Example:=====<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two method. <br />
%First of all, we generate the two data set:<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300) <br />
%In this case: mu_1=[1;1]; Sigma_1=[1 1.5; 1.5 3], where mu and sigma are the mean and covariance matrix.<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300) <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
X=[X1,X2]<br />
X=X'<br />
[coefs, scores]=princomp(X');<br />
coefs(:,1) %first principal component<br />
coefs(:,1)<br />
<br />
ans =<br />
0.76355476446932<br />
0.64574307712603<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=inv(sw)*[4 2]' % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is no overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, <math>Y</math>, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable <math>X</math> is the image. (In the case in which Y is a continuous variable, classification is an application of regression.)<br />
<br />
Consider independent and identically distributed data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> where <math> \displaystyle x_i \in X \subset \mathbb{R}^d </math> and <math> y_i \in Y</math> and Y is a finite set of discrete values. The set of data points <math> \displaystyle \{(x_1,y_1),...,(x_N,y_N)\} </math> is called the ''training set.''<br />
<br />
Then <math>\displaystyle h(x)</math> is a classifier where, given a new data point <math> \displaystyle x</math>, <math> \displaystyle h(x)</math> predicts <math> \displaystyle y</math>. The function <math> \displaystyle h</math> is found using the training set. i.e. the set ''trains'' <math> \displaystyle h</math> to map <math> \displaystyle X</math> to <math> \displaystyle Y</math>:<br />
<br />
<math>\, y = h(x)</math><br />
<br />
<math>\, h: X \to Y</math><br />
<br />
<br />
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.<br />
<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we how can we find the error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide bu the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = \{0,1\} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered the best classifier in terms of error rate. A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math>. This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques. One such technique is finding the Decision Boundary.<br />
<br />
<br />
=== Decision Boundary ===<br />
<br />
[[Image:Decision_boundary_Joanna.jpg|thumb|right|250px]]<br />
<br />
The Decision boundary is given by:<br />
<br />
<math>\, Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x) </math><br />
<br />
Suggesting those points where the probabilities of being in both classes are identical. Thus,<br />
<br />
<br />
<math>\, D: \{ x \mid Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x)\} </math><br />
<br /><br /><br />
Linear discriminant analysis has a linear decision boundary while quadratic discriminant analysis has a decision boundary represented by a quadratic function.<br />
<br /><br /><br /><br />
<br />
=== Linear Discriminant Analysis(LDA) - July 23===<br />
==== Motivation ====<br />
[[Image:LDAmulti.jpg|thumb|right|250px|"LDA decision boundary for 2 classes of multivariate normal data"]]<br />
We would like to apply Bayes Classifiaction rule by approximating the class conditional density <math>f_k(x)</math> and the prior <math>\pi_k</math> where,<br /><br /><br />
<math>P(Y=k|X=x) = \frac{f_k(x)\pi_k}{\sum_{\forall{k}}f_k(x_k)\pi_k}</math><br />
<br /><br /><br />
<br />
By making the following assumptions we can find a linear approximation to the boundary given by Bayes rule,<br />
#The class conditional density is multivariate gaussian<br />
#The classes have a common covariance matrix<br />
<br />
==== Derivation ====<br />
<br />
:'''Note on Quadratic Form'''<br /><br /><br />
: <math>(x + a)^TA(x+b) = x^TAx + a^TAb + x^Tb + x^Ta</math><br />
<br />
<br />
<br />
By assumption (1),<br /><br /><br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}</math><br /><br /><br />
<br />
where <math>\Sigma_k</math> is the class covariance matrix and <math>\mu_k</math> is the class mean. By definition of the decision boundary,<br /><br /><br />
<math>\begin{align}&P(Y=k|X=x) = P(Y=l|X=x)\\ \\ &\frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}\pi_k = \frac{1}{(2\pi)^{d/2}|\Sigma_l|^{1/2}}e^{-\frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l)}\pi_l\end{align} </math><br />
<br /><br /><br />
By assumption (2),<br /><br /><br />
<br />
<math>\begin{align}& \Sigma_k = \Sigma_l = \Sigma\\ \\ & e^{-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)}\pi_k = e^{-\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l)} \pi_l \\ & -\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k) + \log(\pi_k) = -\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l) + \log(\pi_l) \\ & -\frac{1}{2}(x^T\Sigma^{-1}x + \mu_k^T\Sigma^{-1}\mu_k - 2x^T\mu_k) + \frac{1}{2}(x^T\Sigma^{-1}x + \mu_l^T\Sigma^{-1}\mu_l - 2x^T\mu_l) + \log{\frac{\pi_k}{\pi_l}} = 0\ \\ \\ & x^T(u_k - u_l) + \frac{1}{2}(u_l - u_k)^T\Sigma^{-1}(u_l + u_k) + \log{\frac{\pi_k}{\pi_l}} = 0 \end{align}</math><br />
<br /><br /><br />
The result is a linear function of <math>x</math> of the form <math>ax^T + b = 0</math>.<br />
<br />
==== Computational Method ====<br />
<br />
We can implement this computationally by the following:<br />
<br />
Define two variables, <math>\, \delta_k </math> and <math>\, \delta_l </math><br />
<br />
<math> \,\delta_k = log(f_k(x)\pi_k) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_k|) </math><br />
<br />
<math> \,\delta_l = log(f_l(x)\pi_l) = log (\pi_l) - \frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_l|) </math><br />
<br />
<br />
<br />
To classify a point, x, first compute <math>\, \delta_k </math> and <math>\, \delta_l </math>. <br /><br /><br />
Classify it to class k if <math>\, \delta_k > \delta_l </math> and vise versa. <br /><br /><br />
<br />
:<math><br />
h(x) = \begin{cases}<br />
k, & \text{if } \delta_k > \delta_l \\<br />
l, & otherwise\\<br />
\end{cases}</math><br />
<br />
(note: since <math> - \frac{d}{2}log(2\pi) </math> is a constant term, we can simply ignore it in the actual computation since it will cancel out when we do the comparison of the deltas.) <br /><br /><br />
<br />
Consider a '''special case''': <math>\, \Sigma_k = I </math>, the identity matrix. Then,<br />
<br />
<math> \,\delta_k = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k) - \frac{1}{2}log(|I|) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k) </math><br />
<br />
We see that in the case <math> \Sigma = I </math>, we can simply classify a point, x, to a class based on the distances between x and the mean of the different classes (adjusted with the log of the prior). <br /><br /><br />
<br />
In the case that <math>\, \Sigma_k \ne I </math>, we do, <br /><br /><br />
<br />
<math> \, \Sigma_k = USV^T = USU^T </math> (since <math> \Sigma </math> is symmetric)<br /><br /><br />
<math> \, \Sigma_k^{-1} = (USU^T)^{-1} = U^{-1}S^{-1}(U^T)^{-1} = US^{-1}U^T </math><br /><br /><br />
<br />
So, <br /><br /><br />
<br />
<math></div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3440stat341 / CM 3612009-07-24T02:38:45Z<p>Hclam: /* Linear Discriminant Analysis(LDA) - July 23 */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F where f is defined as 0 outside of its domain, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>u_1\dots u_n</math> from <math>F(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
[[Image:HistRandNum.jpg|center|300px|"Histogram showing the expected exponentional distribution" ]]<br />
<br />
<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />[[Image:fxcgx.JPG|thumb|right|300px|"Graph of the pdf of <math>f(x)</math> (target distribution) and <math> c g(x)</math> (proposal distribution)"]]<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant c such that,<br /><br /><br />
<math> c \cdot g(x) \geq f(x)\ \forall x</math><br />
<br /><br /><br />
accepting samples drawn in succession from <math> c \cdot g(x)</math> where<br />
<br /><br /><br />
<math> \frac {f(x)}{c \cdot g(x)} </math> close to 1,<br />
<br /><br /><br />
will yield a sample that follows the target distribution <math>f(x)</math>; we would reject the samples if the ratio is not close to 1.<br />
<br />
<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in Principal Component Analysis).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princomp (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
[[File:Plot1.jpg]]<br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
[[File:Plot2.jpg]]<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
[[File:Plot3.jpg]]<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
figure<br />
subplot(1,2,1)<br />
imagesc(reshape(X(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(X(:,93),8,8)')<br />
[[File:Plot4.jpg]]<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to, depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br> (where <math>\displaystyle \sum_1 and \sum_2 </math> are the covariance matrices for the 1st and 2nd class of data respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br />
The optimization problem we want to solve is,<br />
<br /><br /> <br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2, </math><br />
<br /><br /><br />
<math>\begin{align} (w^T \mu_1 - w^T \mu_2)^2 &= (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) \end{align}</math><br />
<br /><br /><br />
which is a scalar. Therefore,<br />
<br /><br /><br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br /><br /><br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br /><br /><br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br /><br /><br />
Thus our original problem equivalent can be written as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 = \displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br /><br /><br />
For a two class problem the between class variance is,<br />
<br /><br /><br />
<math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math><br />
<br /><br /> <br />
Then this problem can be rewritten as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
# <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br /><br /><br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br /><br />
<math>\displaystyle \frac{(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)} {(w^T(\sum_1 + \sum_2)w)} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max \frac{(w^Ts_Bw)}{(w^Ts_ww)}</math><br />
<br />
<br />
<br />
This is a very famous problem which is called "the generalized eigenvector problem". We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we solve a problem similar to that in PCA,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math> <br /><br />
subject to <math>\displaystyle (w^Ts_Ww=1)</math> <br />
<br /><br /><br />
We solve the following Lagrange Multiplier problem,<br />
<br /><br /><br />
<math>\displaystyle L(w,\lambda) = w^Ts_Bw - \lambda (w^Ts_Ww -1) </math><br /><br /><br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
As discussed in the previous lecture, our Optimization problem for FDA is:<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />Subject to: <br />
<math>\displaystyle (w^Ts_ww=1)</math> <br />
<br />
<br />
Using Lagrange multipliers, we have a Partial solution to: <br />
<math>\displaystyle (w^Ts_Bw) - \lambda * [(w^Ts_ww)-1] </math><br />
<br />
- The optimal solution for w is the eigenvector of <br />
<math>\displaystyle s_w^{-1}s_B </math> <br />
corresponding to the largest eigenvalue;<br />
<br />
- For a k class problem, we will take the eigenvectors corresponding to the (k-1) highest eigenvalues. <br />
<br />
- In the case of two-class problem, the optimal solution for w can be simplfied, such that:<br />
<math>\displaystyle w \propto s_w^{-1}(\mu_2 - \mu_1) </math> <br />
<br />
=====Example:=====<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two method. <br />
%First of all, we generate the two data set:<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300) <br />
%In this case: mu_1=[1;1]; Sigma_1=[1 1.5; 1.5 3], where mu and sigma are the mean and covariance matrix.<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300) <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
X=[X1,X2]<br />
X=X'<br />
[coefs, scores]=princomp(X');<br />
coefs(:,1) %first principal component<br />
coefs(:,1)<br />
<br />
ans =<br />
0.76355476446932<br />
0.64574307712603<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=inv(sw)*[4 2]' % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is no overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, <math>Y</math>, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable <math>X</math> is the image. (In the case in which Y is a continuous variable, classification is an application of regression.)<br />
<br />
Consider independent and identically distributed data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> where <math> \displaystyle x_i \in X \subset \mathbb{R}^d </math> and <math> y_i \in Y</math> and Y is a finite set of discrete values. The set of data points <math> \displaystyle \{(x_1,y_1),...,(x_N,y_N)\} </math> is called the ''training set.''<br />
<br />
Then <math>\displaystyle h(x)</math> is a classifier where, given a new data point <math> \displaystyle x</math>, <math> \displaystyle h(x)</math> predicts <math> \displaystyle y</math>. The function <math> \displaystyle h</math> is found using the training set. i.e. the set ''trains'' <math> \displaystyle h</math> to map <math> \displaystyle X</math> to <math> \displaystyle Y</math>:<br />
<br />
<math>\, y = h(x)</math><br />
<br />
<math>\, h: X \to Y</math><br />
<br />
<br />
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.<br />
<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we how can we find the error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide bu the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = \{0,1\} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered the best classifier in terms of error rate. A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math>. This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques. One such technique is finding the Decision Boundary.<br />
<br />
<br />
=== Decision Boundary ===<br />
<br />
[[Image:Decision_boundary_Joanna.jpg|thumb|right|250px]]<br />
<br />
The Decision boundary is given by:<br />
<br />
<math>\, Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x) </math><br />
<br />
Suggesting those points where the probabilities of being in both classes are identical. Thus,<br />
<br />
<br />
<math>\, D: \{ x \mid Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x)\} </math><br />
<br /><br /><br />
Linear discriminant analysis has a linear decision boundary while quadratic discriminant analysis has a decision boundary represented by a quadratic function.<br />
<br /><br /><br /><br />
<br />
=== Linear Discriminant Analysis(LDA) - July 23===<br />
==== Motivation ====<br />
[[Image:LDAmulti.jpg|thumb|right|250px|"LDA decision boundary for 2 classes of multivariate normal data"]]<br />
We would like to apply Bayes Classifiaction rule by approximating the class conditional density <math>f_k(x)</math> and the prior <math>\pi_k</math> where,<br /><br /><br />
<math>P(Y=k|X=x) = \frac{f_k(x)\pi_k}{\sum_{\forall{k}}f_k(x_k)\pi_k}</math><br />
<br /><br /><br />
<br />
By making the following assumptions we can find a linear approximation to the boundary given by Bayes rule,<br />
#The class conditional density is multivariate gaussian<br />
#The classes have a common covariance matrix<br />
<br />
==== Derivation ====<br />
<br />
:'''Note on Quadratic Form'''<br /><br /><br />
: <math>(x + a)^TA(x+b) = x^TAx + a^TAb + x^Tb + x^Ta</math><br />
<br />
<br />
<br />
By assumption (1),<br /><br /><br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}</math><br /><br /><br />
<br />
where <math>\Sigma_k</math> is the class covariance matrix and <math>\mu_k</math> is the class mean. By definition of the decision boundary,<br /><br /><br />
<math>\begin{align}&P(Y=k|X=x) = P(Y=l|X=x)\\ \\ &\frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}\pi_k = \frac{1}{(2\pi)^{d/2}|\Sigma_l|^{1/2}}e^{-\frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l)}\pi_l\end{align} </math><br />
<br /><br /><br />
By assumption (2),<br /><br /><br />
<br />
<math>\begin{align}& \Sigma_k = \Sigma_l = \Sigma\\ \\ & e^{-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)}\pi_k = e^{-\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l)} \pi_l \\ & -\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k) + \log(\pi_k) = -\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l) + \log(\pi_l) \\ & -\frac{1}{2}(x^T\Sigma^{-1}x + \mu_k^T\Sigma^{-1}\mu_k - 2x^T\mu_k) + \frac{1}{2}(x^T\Sigma^{-1}x + \mu_l^T\Sigma^{-1}\mu_l - 2x^T\mu_l) + \log{\frac{\pi_k}{\pi_l}} = 0\ \\ \\ & x^T(u_k - u_l) + \frac{1}{2}(u_l - u_k)^T\Sigma^{-1}(u_l + u_k) + \log{\frac{\pi_k}{\pi_l}} = 0 \end{align}</math><br />
<br /><br /><br />
The result is a linear function of <math>x</math> of the form <math>ax^T + b = 0</math>.<br />
<br />
==== Computational Method ====<br />
<br />
We can implement this computationally by the following:<br />
<br />
Define two variables, <math>\, \delta_k </math> and <math>\, \delta_l </math><br />
<br />
<math> \,\delta_k = log(f_k(x)\pi_k) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_k|) </math><br />
<br />
<math> \,\delta_l = log(f_l(x)\pi_l) = log (\pi_l) - \frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_l|) </math><br />
<br />
<br />
<br />
To classify a point, x, first compute <math>\, \delta_k </math> and <math>\, \delta_l </math>. <br /><br /><br />
Classify it to class k if <math>\, \delta_k > \delta_l </math> and vise versa. <br /><br /><br />
<br />
:<math><br />
h(x) = \begin{cases}<br />
k, & \text{if } \delta_k > \delta_l \\<br />
l, & otherwise\\<br />
\end{cases}</math><br />
<br />
(note: since <math> - \frac{d}{2}log(2\pi) </math> is a constant term, we can simply ignore it in the actual computation since it will cancel out when we do the comparison of the deltas.) <br />
<br />
Consider a special case: <math>\, \Sigma_k = I </math>, the identity matrix.<br />
<br />
<math></div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3439stat341 / CM 3612009-07-24T02:31:49Z<p>Hclam: /* Linear Discriminant Analysis(LDA) - July 23 */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F where f is defined as 0 outside of its domain, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>u_1\dots u_n</math> from <math>F(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
[[Image:HistRandNum.jpg|center|300px|"Histogram showing the expected exponentional distribution" ]]<br />
<br />
<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />[[Image:fxcgx.JPG|thumb|right|300px|"Graph of the pdf of <math>f(x)</math> (target distribution) and <math> c g(x)</math> (proposal distribution)"]]<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant c such that,<br /><br /><br />
<math> c \cdot g(x) \geq f(x)\ \forall x</math><br />
<br /><br /><br />
accepting samples drawn in succession from <math> c \cdot g(x)</math> where<br />
<br /><br /><br />
<math> \frac {f(x)}{c \cdot g(x)} </math> close to 1,<br />
<br /><br /><br />
will yield a sample that follows the target distribution <math>f(x)</math>; we would reject the samples if the ratio is not close to 1.<br />
<br />
<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in Principal Component Analysis).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br /><br /><br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<br /><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br /><br /><br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br /><br /><br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math>. The first principal component is the vector that maximizes the variance,<br />
<br /><br /><br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br /><br /><br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint.<br /> The problem then becomes,<br />
<br /><br /><br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<br />
<gallery><br />
Image:face1.jpg|"Noisy Face"<br />
Image:face2.jpg|"De-noised Face"<br />
</gallery><br />
<br />
<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princomp (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
[[File:Plot1.jpg]]<br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
[[File:Plot2.jpg]]<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
[[File:Plot3.jpg]]<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
figure<br />
subplot(1,2,1)<br />
imagesc(reshape(X(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(X(:,93),8,8)')<br />
[[File:Plot4.jpg]]<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
<br />
[[Image:PCAalgorithm.JPG|600px]]<br />
<br />
<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see details[http://en.wikipedia.org/wiki/Principle_component_analysis PCA in Wikipedia].)<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to, depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br> (where <math>\displaystyle \sum_1 and \sum_2 </math> are the covariance matrices for the 1st and 2nd class of data respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> be the within classes covariance.<br />
Then, this problem can be rewritten as <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data====<br />
<br /><br />
The optimization problem we want to solve is,<br />
<br /><br /> <br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2, </math><br />
<br /><br /><br />
<math>\begin{align} (w^T \mu_1 - w^T \mu_2)^2 &= (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2)\\<br />
&= (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) \end{align}</math><br />
<br /><br /><br />
which is a scalar. Therefore,<br />
<br /><br /><br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br /><br /><br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br /><br /><br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br /><br /><br />
Thus our original problem equivalent can be written as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 = \displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br /><br /><br />
For a two class problem the between class variance is,<br />
<br /><br /><br />
<math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math><br />
<br /><br /> <br />
Then this problem can be rewritten as,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br /><br />
# <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
# <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br /><br /><br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br /><br />
<math>\displaystyle \frac{(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)} {(w^T(\sum_1 + \sum_2)w)} </math><br />
<br />
or equivalently,<br /><br /><br />
<br />
<math>\displaystyle \max \frac{(w^Ts_Bw)}{(w^Ts_ww)}</math><br />
<br />
<br />
<br />
This is a very famous problem which is called "the generalized eigenvector problem". We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we solve a problem similar to that in PCA,<br />
<br /><br /><br />
<math>\displaystyle \max (w^Ts_Bw)</math> <br /><br />
subject to <math>\displaystyle (w^Ts_Ww=1)</math> <br />
<br /><br /><br />
We solve the following Lagrange Multiplier problem,<br />
<br /><br /><br />
<math>\displaystyle L(w,\lambda) = w^Ts_Bw - \lambda (w^Ts_Ww -1) </math><br /><br /><br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
As discussed in the previous lecture, our Optimization problem for FDA is:<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />Subject to: <br />
<math>\displaystyle (w^Ts_ww=1)</math> <br />
<br />
<br />
Using Lagrange multipliers, we have a Partial solution to: <br />
<math>\displaystyle (w^Ts_Bw) - \lambda * [(w^Ts_ww)-1] </math><br />
<br />
- The optimal solution for w is the eigenvector of <br />
<math>\displaystyle s_w^{-1}s_B </math> <br />
corresponding to the largest eigenvalue;<br />
<br />
- For a k class problem, we will take the eigenvectors corresponding to the (k-1) highest eigenvalues. <br />
<br />
- In the case of two-class problem, the optimal solution for w can be simplfied, such that:<br />
<math>\displaystyle w \propto s_w^{-1}(\mu_2 - \mu_1) </math> <br />
<br />
=====Example:=====<br />
<br />
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two method. <br />
%First of all, we generate the two data set:<br />
X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300) <br />
%In this case: mu_1=[1;1]; Sigma_1=[1 1.5; 1.5 3], where mu and sigma are the mean and covariance matrix.<br />
X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300) <br />
%Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]<br />
%The plot of the two distributions is:<br />
<br />
[[File:Mvrnd.jpg]]<br />
<br />
%We compute the principal components:<br />
X=[X1,X2]<br />
X=X'<br />
[coefs, scores]=princomp(X');<br />
coefs(:,1) %first principal component<br />
coefs(:,1)<br />
<br />
ans =<br />
0.76355476446932<br />
0.64574307712603<br />
<br />
plot([0 coefs(1,1)], [0 coefs(2,1)],'b')<br />
plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')<br />
sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1<br />
w=inv(sw)*[4 2]' % calculate s_w^{-1}(mu2 - mu1)<br />
plot ([0 w(1)], [0 w(2)],'g')<br />
<br />
[[File:Pca_full_1.jpg]]<br />
<br />
%We now make the projection:<br />
Xf=w'*X<br />
figure<br />
plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"<br />
hold on<br />
plot(Xf(301:600),1,'or')<br />
<br />
<br />
[[File:Fisher_no_overlap.jpg]]<br />
<br />
%We see that in the above picture that there is no overlapping<br />
Xp=coefs(:,1)'*X<br />
figure<br />
plot(Xp(1:300),1,'b')<br />
hold on<br />
plot(Xp(301:600),2,'or') <br />
<br />
<br />
[[File:Pca_overlap.jpg]]<br />
<br />
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, <math>Y</math>, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable <math>X</math> is the image. (In the case in which Y is a continuous variable, classification is an application of regression.)<br />
<br />
Consider independent and identically distributed data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> where <math> \displaystyle x_i \in X \subset \mathbb{R}^d </math> and <math> y_i \in Y</math> and Y is a finite set of discrete values. The set of data points <math> \displaystyle \{(x_1,y_1),...,(x_N,y_N)\} </math> is called the ''training set.''<br />
<br />
Then <math>\displaystyle h(x)</math> is a classifier where, given a new data point <math> \displaystyle x</math>, <math> \displaystyle h(x)</math> predicts <math> \displaystyle y</math>. The function <math> \displaystyle h</math> is found using the training set. i.e. the set ''trains'' <math> \displaystyle h</math> to map <math> \displaystyle X</math> to <math> \displaystyle Y</math>:<br />
<br />
<math>\, y = h(x)</math><br />
<br />
<math>\, h: X \to Y</math><br />
<br />
<br />
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.<br />
<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we how can we find the error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide bu the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = \{0,1\} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered the best classifier in terms of error rate. A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math>. This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques. One such technique is finding the Decision Boundary.<br />
<br />
<br />
=== Decision Boundary ===<br />
<br />
[[Image:Decision_boundary_Joanna.jpg|thumb|right|250px]]<br />
<br />
The Decision boundary is given by:<br />
<br />
<math>\, Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x) </math><br />
<br />
Suggesting those points where the probabilities of being in both classes are identical. Thus,<br />
<br />
<br />
<math>\, D: \{ x \mid Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x)\} </math><br />
<br /><br /><br />
Linear discriminant analysis has a linear decision boundary while quadratic discriminant analysis has a decision boundary represented by a quadratic function.<br />
<br /><br /><br /><br />
<br />
=== Linear Discriminant Analysis(LDA) - July 23===<br />
==== Motivation ====<br />
[[Image:LDAmulti.jpg|thumb|right|250px|"LDA decision boundary for 2 classes of multivariate normal data"]]<br />
We would like to apply Bayes Classifiaction rule by approximating the class conditional density <math>f_k(x)</math> and the prior <math>\pi_k</math> where,<br /><br /><br />
<math>P(Y=k|X=x) = \frac{f_k(x)\pi_k}{\sum_{\forall{k}}f_k(x_k)\pi_k}</math><br />
<br /><br /><br />
<br />
By making the following assumptions we can find a linear approximation to the boundary given by Bayes rule,<br />
#The class conditional density is multivariate gaussian<br />
#The classes have a common covariance matrix<br />
<br />
==== Derivation ====<br />
<br />
:'''Note on Quadratic Form'''<br /><br /><br />
: <math>(x + a)^TA(x+b) = x^TAx + a^TAb + x^Tb + x^Ta</math><br />
<br />
<br />
<br />
By assumption (1),<br /><br /><br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}</math><br /><br /><br />
<br />
where <math>\Sigma_k</math> is the class covariance matrix and <math>\mu_k</math> is the class mean. By definition of the decision boundary,<br /><br /><br />
<math>\begin{align}&P(Y=k|X=x) = P(Y=l|X=x)\\ \\ &\frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}\pi_k = \frac{1}{(2\pi)^{d/2}|\Sigma_l|^{1/2}}e^{-\frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l)}\pi_l\end{align} </math><br />
<br /><br /><br />
By assumption (2),<br /><br /><br />
<br />
<math>\begin{align}& \Sigma_k = \Sigma_l = \Sigma\\ \\ & e^{-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)}\pi_k = e^{-\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l)} \pi_l \\ & -\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k) + \log(\pi_k) = -\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l) + \log(\pi_l) \\ & -\frac{1}{2}(x^T\Sigma^{-1}x + \mu_k^T\Sigma^{-1}\mu_k - 2x^T\mu_k) + \frac{1}{2}(x^T\Sigma^{-1}x + \mu_l^T\Sigma^{-1}\mu_l - 2x^T\mu_l) + \log{\frac{\pi_k}{\pi_l}} = 0\ \\ \\ & x^T(u_k - u_l) + \frac{1}{2}(u_l - u_k)^T\Sigma^{-1}(u_l + u_k) + \log{\frac{\pi_k}{\pi_l}} = 0 \end{align}</math><br />
<br /><br /><br />
The result is a linear function of <math>x</math> of the form <math>ax^T + b = 0</math>.<br />
<br />
==== Computational Method ====<br />
<br />
We can implement this computationally by the following:<br />
<br />
Define two variables, <math>\, \delta_k </math> and <math>\, \delta_l </math><br />
<br />
<math> \,\delta_k = log(f_k(x)\pi_k) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_k|) </math><br />
<br />
<math> \,\delta_l = log(f_l(x)\pi_l) = log (\pi_l) - \frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_l|) </math><br />
<br />
To classify a point, x, first compute <math>\, \delta_k </math> and <math>\, \delta_l </math>. <br /><br /><br />
Classify it to class k if <math>\, \delta_k > \delta_l </math> and vise versa. <br /><br /></div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3350stat341 / CM 3612009-07-21T23:08:20Z<p>Hclam: /* 1. Minimize the within class variance */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>i=1\dots n</math> from <math>f(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
<br />
The histogram shows exponential pattern as expected.<br />
<br />
[[File:HistRandNum.jpg]]<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant <math>c \ |\ c \cdot g(x) \geq f(x)\ \forall x</math>, accepting samples drawn in successions from <math> c \cdot g(x)</math> with ratio <math> \frac {f(x)}{c \cdot g(x)} </math> close to 1 will yield a sample that follows the target distribution <math>f(x)</math>; on the other hand we would reject the samples if the ratio is not close to 1.<br />
<br />
The following graph shows the pdf of <math>f(x)</math> (target distribution) and <math> c \cdot g(x)</math> (proposal distribution)<br />
<br />
[[File:fxcgx.JPG]]<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in [http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis]).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math><br />
<br />
The first principal component is the vector that maximizes the variance<br />
<br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function.<br />
<br />
Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint. The problem then becomes,<br />
<br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
The noisy face:<br />
<br />
[[File:face1.jpg]]<br />
<br />
The de-noised face:<br />
<br />
[[File:face2.jpg]]<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princom (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
Figure<br />
subplot(1,2,1)<br />
imagesc(reshape(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(:,93),8,8)')<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
[[File:PCAalgorithm.JPG]]<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean.<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br> (where <math>\displaystyle \sum_1 and \sum_2 </math> are the covariance matrices for the 1st and 2nd class of data respectively) <br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> : within classes covariance.<br />
Then, this problem can be rewritten: <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data after projection====<br />
<br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 </math><br />
<br />
<math>\displaystyle = (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) </math><br />
<br />
which is a scalar. Therefore,<br />
<br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br />
Thus, we get our original problem equivalent to<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br />
Let <math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math> : between classes covariance.<br />
Then, this problem can be rewritten: <math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br />
(1) <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
(2) <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br />
<math>\displaystyle [(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) ] / [(w^T(\sum_1 + \sum_2)w)] </math><br />
<br />
or<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)/(w^Ts_ww)</math><br />
<br />
<br />
<br />
This is a very famous problem - "the generalized eigenvector problem". We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we can solve the following constrained optimization problem<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
<br />Subject To: <br />
<math>\displaystyle (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)=1 </math> or <math>\displaystyle (w^Ts_Bw=1)</math> <br />
<br />
<br />
<br />
<br />
Therefore, the function that we want to maximize is<br />
<br />
<math>\displaystyle (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) - \lambda * [(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)-1] </math><br />
<br />
or <br />
<br />
<math>\displaystyle (w^Ts_Bw) - \lambda * [(w^Ts_ww)-1] </math><br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
Optimization problem:<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
<br />Subject to: <br />
<math>\displaystyle (w^Ts_Bw=1)</math> <br />
<br />
Partial solution: <math>\displaystyle (w^Ts_Bw) - \lambda * [(w^Ts_ww)-1] </math><br />
<br />
<br />
<br />
<br />
<br />
Example: FDA<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, <math>Y</math>, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable <math>X</math> is the image.<br />
<br />
Consider independent and identically distributed data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> where <math> \displaystyle x_i \in X \subset \mathbb{R}^d </math> and <math> y_i \in Y</math> and Y is a finite set of discrete values. The data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> are called the ''training set.''<br />
<br />
Then <math>h(x)</math> is a classifier where, given a new data point <math> \displaystyle x</math>, <math> \displaystyle h(x)</math> predicts <math> \displaystyle y</math>. The function <math> \displaystyle h</math> is found using the training set. i.e. the set ''trains'' <math> \displaystyle h</math> to map <math> \displaystyle X</math> to <math> \displaystyle Y</math>:<br />
<br />
<math>\, y = h(x)</math><br />
<br />
<math>\, h: X \to Y</math><br />
<br />
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we how can we find the error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide bu the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = {0,1} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered the best classifier in terms of error rate. <br />
<br />
A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math><br />
<br />
This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques.<br />
<br />
One such technique is Decision Boundary<br />
<br />
=== Decision Boundary ===</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3348stat341 / CM 3612009-07-21T23:03:43Z<p>Hclam: /* Objective Function */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>i=1\dots n</math> from <math>f(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
<br />
The histogram shows exponential pattern as expected.<br />
<br />
[[File:HistRandNum.jpg]]<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant <math>c \ |\ c \cdot g(x) \geq f(x)\ \forall x</math>, accepting samples drawn in successions from <math> c \cdot g(x)</math> with ratio <math> \frac {f(x)}{c \cdot g(x)} </math> close to 1 will yield a sample that follows the target distribution <math>f(x)</math>; on the other hand we would reject the samples if the ratio is not close to 1.<br />
<br />
The following graph shows the pdf of <math>f(x)</math> (target distribution) and <math> c \cdot g(x)</math> (proposal distribution)<br />
<br />
[[File:fxcgx.JPG]]<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in [http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis]).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math><br />
<br />
The first principal component is the vector that maximizes the variance<br />
<br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function.<br />
<br />
Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint. The problem then becomes,<br />
<br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
The noisy face:<br />
<br />
[[File:face1.jpg]]<br />
<br />
The de-noised face:<br />
<br />
[[File:face2.jpg]]<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princom (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
Figure<br />
subplot(1,2,1)<br />
imagesc(reshape(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(:,93),8,8)')<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
[[File:PCAalgorithm.JPG]]<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean.<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> : within classes covariance.<br />
Then, this problem can be rewritten: <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data after projection====<br />
<br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 </math><br />
<br />
<math>\displaystyle = (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) </math><br />
<br />
which is a scalar. Therefore,<br />
<br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br />
Thus, we get our original problem equivalent to<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br />
Let <math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math> : between classes covariance.<br />
Then, this problem can be rewritten: <math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br />
(1) <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
(2) <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br />
<math>\displaystyle [(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) ] / [(w^T(\sum_1 + \sum_2)w)] </math><br />
<br />
or<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)/(w^Ts_ww)</math><br />
<br />
<br />
<br />
This is a very famous problem - "the generalized eigenvector problem". We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we can solve the following constrained optimization problem<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
<br />Subject To: <br />
<math>\displaystyle (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)=1 </math> or <math>\displaystyle (w^Ts_Bw=1)</math> <br />
<br />
<br />
<br />
<br />
Therefore, the function that we want to maximize is<br />
<br />
<math>\displaystyle (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) - \lambda * [(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)-1] </math><br />
<br />
or <br />
<br />
<math>\displaystyle (w^Ts_Bw) - \lambda * [(w^Ts_ww)-1] </math><br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Example: FDA<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, <math>Y</math>, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable <math>X</math> is the image.<br />
<br />
Consider independent and identically distributed data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> where <math> \displaystyle x_i \in X \subset \mathbb{R}^d </math> and <math> y_i \in Y</math> and Y is a finite set of discrete values. The data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> are called the ''training set.''<br />
<br />
Then <math>h(x)</math> is a classifier where, given a new data point <math> \displaystyle x</math>, <math> \displaystyle h(x)</math> predicts <math> \displaystyle y</math>. The function <math> \displaystyle h</math> is found using the training set. i.e. the set ''trains'' <math> \displaystyle h</math> to map <math> \displaystyle X</math> to <math> \displaystyle Y</math>:<br />
<br />
<math>\, y = h(x)</math><br />
<br />
<math>\, h: X \to Y</math><br />
<br />
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we how can we find the error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide bu the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = {0,1} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered the best classifier in terms of error rate. <br />
<br />
A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math><br />
<br />
This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques.<br />
<br />
One such technique is Decision Boundary<br />
<br />
=== Decision Boundary ===</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3347stat341 / CM 3612009-07-21T22:59:54Z<p>Hclam: /* Application of PCA - Feature Abstraction */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>i=1\dots n</math> from <math>f(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
<br />
The histogram shows exponential pattern as expected.<br />
<br />
[[File:HistRandNum.jpg]]<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant <math>c \ |\ c \cdot g(x) \geq f(x)\ \forall x</math>, accepting samples drawn in successions from <math> c \cdot g(x)</math> with ratio <math> \frac {f(x)}{c \cdot g(x)} </math> close to 1 will yield a sample that follows the target distribution <math>f(x)</math>; on the other hand we would reject the samples if the ratio is not close to 1.<br />
<br />
The following graph shows the pdf of <math>f(x)</math> (target distribution) and <math> c \cdot g(x)</math> (proposal distribution)<br />
<br />
[[File:fxcgx.JPG]]<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in [http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis]).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math><br />
<br />
The first principal component is the vector that maximizes the variance<br />
<br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function.<br />
<br />
Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint. The problem then becomes,<br />
<br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
The noisy face:<br />
<br />
[[File:face1.jpg]]<br />
<br />
The de-noised face:<br />
<br />
[[File:face2.jpg]]<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princom (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
Figure<br />
subplot(1,2,1)<br />
imagesc(reshape(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(:,93),8,8)')<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
[[File:PCAalgorithm.JPG]]<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean.<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> : within classes covariance.<br />
Then, this problem can be rewritten: <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data after projection====<br />
<br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 </math><br />
<br />
<math>\displaystyle = (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) </math><br />
<br />
which is a scalar. Therefore,<br />
<br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br />
Thus, we get our original problem equivalent to<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br />
Let <math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math> : between classes covariance.<br />
Then, this problem can be rewritten: <math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br />
(1) <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
(2) <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br />
<math>\displaystyle [(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) ] / [(w^T(\sum_1 + \sum_2)w)] </math><br />
<br />
or<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)/(w^Ts_ww)</math><br />
<br />
<br />
<br />
This is a very famous problem. We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we can solve the following constrained optimization problem<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
<br />Subject To: <br />
<math>\displaystyle (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)=1 </math> or <math>\displaystyle (w^Ts_Bw=1)</math> <br />
<br />
<br />
<br />
<br />
Therefore, the function that we want to maximize is<br />
<br />
<math>\displaystyle (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) - \lambda * [(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)-1] </math><br />
<br />
or <br />
<br />
<math>\displaystyle (w^Ts_Bw) - \lambda * [(w^Ts_ww)-1] </math><br />
<br />
<br />
<br />
<br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Example: FDA<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, <math>Y</math>, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable <math>X</math> is the image.<br />
<br />
Consider independent and identically distributed data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> where <math> \displaystyle x_i \in X \subset \mathbb{R}^d </math> and <math> y_i \in Y</math> and Y is a finite set of discrete values. The data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> are called the ''training set.''<br />
<br />
Then <math>h(x)</math> is a classifier where, given a new data point <math> \displaystyle x</math>, <math> \displaystyle h(x)</math> predicts <math> \displaystyle y</math>. The function <math> \displaystyle h</math> is found using the training set. i.e. the set ''trains'' <math> \displaystyle h</math> to map <math> \displaystyle X</math> to <math> \displaystyle Y</math>:<br />
<br />
<math>\, y = h(x)</math><br />
<br />
<math>\, h: X \to Y</math><br />
<br />
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we how can we find the error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide bu the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = {0,1} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered the best classifier in terms of error rate. <br />
<br />
A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math><br />
<br />
This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques.<br />
<br />
One such technique is Decision Boundary<br />
<br />
=== Decision Boundary ===</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3346stat341 / CM 3612009-07-21T22:56:02Z<p>Hclam: /* Classification - July 21 (cont) */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>i=1\dots n</math> from <math>f(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
<br />
The histogram shows exponential pattern as expected.<br />
<br />
[[File:HistRandNum.jpg]]<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant <math>c \ |\ c \cdot g(x) \geq f(x)\ \forall x</math>, accepting samples drawn in successions from <math> c \cdot g(x)</math> with ratio <math> \frac {f(x)}{c \cdot g(x)} </math> close to 1 will yield a sample that follows the target distribution <math>f(x)</math>; on the other hand we would reject the samples if the ratio is not close to 1.<br />
<br />
The following graph shows the pdf of <math>f(x)</math> (target distribution) and <math> c \cdot g(x)</math> (proposal distribution)<br />
<br />
[[File:fxcgx.JPG]]<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in [http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis]).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math><br />
<br />
The first principal component is the vector that maximizes the variance<br />
<br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function.<br />
<br />
Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint. The problem then becomes,<br />
<br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
The noisy face:<br />
<br />
[[File:face1.jpg]]<br />
<br />
The de-noised face:<br />
<br />
[[File:face2.jpg]]<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (images etc). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princom (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
Figure<br />
subplot(1,2,1)<br />
imagesc(reshape(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(:,93),8,8)')<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
[[File:PCAalgorithm.JPG]]<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean.<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> : within classes covariance.<br />
Then, this problem can be rewritten: <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data after projection====<br />
<br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 </math><br />
<br />
<math>\displaystyle = (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) </math><br />
<br />
which is a scalar. Therefore,<br />
<br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br />
Thus, we get our original problem equivalent to<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br />
Let <math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math> : between classes covariance.<br />
Then, this problem can be rewritten: <math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br />
(1) <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
(2) <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br />
<math>\displaystyle [(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) ] / [(w^T(\sum_1 + \sum_2)w)] </math><br />
<br />
or<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)/(w^Ts_ww)</math><br />
<br />
<br />
<br />
This is a very famous problem. We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we can solve the following constrained optimization problem<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
<br />Subject To: <br />
<math>\displaystyle (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)=1 </math> or <math>\displaystyle (w^Ts_Bw=1)</math> <br />
<br />
<br />
<br />
<br />
Therefore, the function that we want to maximize is<br />
<br />
<math>\displaystyle (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) - \lambda * [(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)-1] </math><br />
<br />
or <br />
<br />
<math>\displaystyle (w^Ts_Bw) - \lambda * [(w^Ts_ww)-1] </math><br />
<br />
<br />
<br />
<br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Example: FDA<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, <math>Y</math>, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable <math>X</math> is the image.<br />
<br />
Consider independent and identically distributed data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> where <math> \displaystyle x_i \in X \subset \mathbb{R}^d </math> and <math> y_i \in Y</math> and Y is a finite set of discrete values. The data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> are called the ''training set.''<br />
<br />
Then <math>h(x)</math> is a classifier where, given a new data point <math> \displaystyle x</math>, <math> \displaystyle h(x)</math> predicts <math> \displaystyle y</math>. The function <math> \displaystyle h</math> is found using the training set. i.e. the set ''trains'' <math> \displaystyle h</math> to map <math> \displaystyle X</math> to <math> \displaystyle Y</math>:<br />
<br />
<math>\, y = h(x)</math><br />
<br />
<math>\, h: X \to Y</math><br />
<br />
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we how can we find the error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide bu the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = {0,1} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered the best classifier in terms of error rate. <br />
<br />
A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math><br />
<br />
This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques.<br />
<br />
One such technique is Decision Boundary<br />
<br />
=== Decision Boundary ===</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3345stat341 / CM 3612009-07-21T22:55:24Z<p>Hclam: /* Classification - July 21 (cont) */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>i=1\dots n</math> from <math>f(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
<br />
The histogram shows exponential pattern as expected.<br />
<br />
[[File:HistRandNum.jpg]]<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant <math>c \ |\ c \cdot g(x) \geq f(x)\ \forall x</math>, accepting samples drawn in successions from <math> c \cdot g(x)</math> with ratio <math> \frac {f(x)}{c \cdot g(x)} </math> close to 1 will yield a sample that follows the target distribution <math>f(x)</math>; on the other hand we would reject the samples if the ratio is not close to 1.<br />
<br />
The following graph shows the pdf of <math>f(x)</math> (target distribution) and <math> c \cdot g(x)</math> (proposal distribution)<br />
<br />
[[File:fxcgx.JPG]]<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in [http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis]).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math><br />
<br />
The first principal component is the vector that maximizes the variance<br />
<br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function.<br />
<br />
Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint. The problem then becomes,<br />
<br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
The noisy face:<br />
<br />
[[File:face1.jpg]]<br />
<br />
The de-noised face:<br />
<br />
[[File:face2.jpg]]<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (images etc). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princom (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
Figure<br />
subplot(1,2,1)<br />
imagesc(reshape(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(:,93),8,8)')<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
[[File:PCAalgorithm.JPG]]<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean.<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> : within classes covariance.<br />
Then, this problem can be rewritten: <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data after projection====<br />
<br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 </math><br />
<br />
<math>\displaystyle = (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) </math><br />
<br />
which is a scalar. Therefore,<br />
<br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br />
Thus, we get our original problem equivalent to<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br />
Let <math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math> : between classes covariance.<br />
Then, this problem can be rewritten: <math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br />
(1) <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
(2) <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br />
<math>\displaystyle [(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) ] / [(w^T(\sum_1 + \sum_2)w)] </math><br />
<br />
or<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)/(w^Ts_ww)</math><br />
<br />
<br />
<br />
This is a very famous problem. We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we can solve the following constrained optimization problem<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
<br />Subject To: <br />
<math>\displaystyle (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)=1 </math> or <math>\displaystyle (w^Ts_Bw=1)</math> <br />
<br />
<br />
<br />
<br />
Therefore, the function that we want to maximize is<br />
<br />
<math>\displaystyle (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) - \lambda * [(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)-1] </math><br />
<br />
or <br />
<br />
<math>\displaystyle (w^Ts_Bw) - \lambda * [(w^Ts_ww)-1] </math><br />
<br />
<br />
<br />
<br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Example: FDA<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, <math>Y</math>, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable <math>X</math> is the image.<br />
<br />
Consider independent and identically distributed data points <math> \displaystyle (x_1,x_1),...,(x_N,y_N) </math> where <math> \displaystyle x_i \in X \subset \mathbb{R}^d </math> and <math> y_i \in Y</math> and Y is a finite set of discrete values. The data points <math> \displaystyle (x_1,y_1),...,(x_N,y_N) </math> are called the ''training set.''<br />
<br />
Then <math>h(x)</math> is a classifier where, given a new data point <math> \displaystyle x</math>, <math> \displaystyle h(x)</math> predicts <math> \displaystyle y</math>. The function <math> \displaystyle h</math> is found using the training set. i.e. the set ''trains'' <math> \displaystyle h</math> to map <math> \displaystyle X</math> to <math> \displaystyle Y</math>:<br />
<br />
<math>\, y = h(x)</math><br />
<br />
<math>\, h: X \to Y</math><br />
<br />
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we how can we find the error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide bu the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = {0,1} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered the best classifier in terms of error rate. <br />
<br />
A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math><br />
<br />
This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques.<br />
<br />
One such technique is Decision Boundary<br />
<br />
=== Decision Boundary ===</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=schedule&diff=3243schedule2009-07-19T22:38:15Z<p>Hclam: </p>
<hr />
<div>{| class="wikitable"<br />
<br />
{| border="1" cellpadding="2"<br />
|-<br />
|width="100pt"|Date<br />
|width="200pt"|Name<br />
|-<br />
|May 12 || Jeremy Sharpe<br />
|-<br />
|May 14 || Keith, Ho Chi Lam <br />
|-<br />
|May 19 || Mathieu Zerter <br />
|-<br />
|May 21 || Jeff Li <br />
|-<br />
|May 26 || Your name <br />
|-<br />
|May 28 || Laura Chelaru <br />
|-<br />
|June 2 || Timothy Choy <br />
|-<br />
|June 4 || Wenjing Zhao<br />
|-<br />
|June 9 || Mark Stuart <br />
|-<br />
|June 11 || Alberto Carignano <br />
|-<br />
|June 16 || Jeff Siswanto<br />
|-<br />
|June 18|| Iulia Pargaru <br />
|-<br />
|June 23 || NO WIKI COURSE NOTES - ASSIGNMENT 2 REVIEW<br />
|-<br />
|June 25|| Raghav Malik <br />
|-<br />
|June 30|| Alexandra Florescu <br />
|-<br />
|July 2|| Jon Walsh<br />
|-<br />
|July 7|| Laura Chelaru, Alexandra Florescu, Tyler Hargrave<br />
|-<br />
|July 9|| Luke Schaeffer<br />
|-<br />
|July 14|| Janm Mehta, Timothy Choy<br />
|-<br />
|July 16|| Tyler Hargrave<br />
|-<br />
|July 21|| Raghav Malik, Iulia Pargaru, Wenjing Zhao, Jon Walsh, Alberto Carignano, Jeff Li<br />
|-<br />
|July 23|| Timothy Choy, Mathieu Zerter, Keith (Ho Chi) Lam<br />
|-<br />
|July 28|| Luke Schaeffer and ...<br />
|}</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3228stat341 / CM 3612009-07-18T20:55:23Z<p>Hclam: /* Lagrange Multiplier */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>i=1\dots n</math> from <math>f(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
<br />
The histogram shows exponential pattern as expected.<br />
<br />
[[File:HistRandNum.jpg]]<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant <math>c \ |\ c \cdot g(x) \geq f(x)\ \forall x</math>, accepting samples drawn in successions from <math> c \cdot g(x)</math> with ratio <math> \frac {f(x)}{c \cdot g(x)} </math> close to 1 will yield a sample that follows the target distribution <math>f(x)</math>; on the other hand we would reject the samples if the ratio is not close to 1.<br />
<br />
The following graph shows the pdf of <math>f(x)</math> (target distribution) and <math> c \cdot g(x)</math> (proposal distribution)<br />
<br />
[[File:fxcgx.JPG]]<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in [http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis]).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
Given a high-dimensional sample of vectors, applying PCA produces an orthogonal set of vectors (called principal components) such that the first principal component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
If we ignore the last few principal components (directions with the smallest variance) then we can approximate the data by a lower-dimensional subspace, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction], PCA is linear).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of PCA===<br />
We want to find the direction of maximum variation. So take a direction <math>w = [w_1, \ldots, w_D]^T</math> and a data point <math>x = [x_1, \ldots, x_D]^T </math> then compute the length of the projection of the point in direction.<br />
<br />
<math><br />
u = \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
</math><br />
<br />
Of course, the direction <math>\textbf{w}</math> is the same as <math>2\textbf{w}</math> or in general <math>c\textbf{w}</math>, and it doesn't matter which one we use. So without loss of generality, let the length of <math>\textbf{w}</math> be 1. Therefore <math>\textbf{w}^T \textbf{w} = 1</math> so the equation simplifies to just<br />
<br />
<math><br />
u = \textbf{w}^T \textbf{x}.<br />
</math><br />
<br />
Let <math>x_1, \ldots, x_D</math> be a random variables, then our goal is to maximize the variance of <math>u</math>, which is<br />
<br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br />
where <math>\Sigma</math> is the covariance matrix. For a finite data set we can replace <math>\Sigma</math> by <math>s</math>, the sample covariance matrix. <br />
<br />
So, <math>\displaystyle w^T sw </math> is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math><br />
<br />
The first principal component is the vector that maximizes the variance<br />
<br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function.<br />
<br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br />
Therefore the variance is bounded, so the maximum exists. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, but subject to a constraint. The problem then becomes:<br />
<br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br />
<br />
Next lecture we will actually find the maximum.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a Lagrange Multiplier and we form the Lagrangian L:<br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of L (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touches but do not cross. At this point, the tangent of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel (or the gradient of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel).<br />
<br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where <math>\displaystyle \nabla_{x,y} f = (\frac{\delta f}{dx},\frac{\delta f}{dy})</math><br />
<br><br />
and <math>\displaystyle \nabla_{x,y} g = (\frac{\delta g}{dx},\frac{\delta g}{dy})</math><br />
<br><br />
To incorporate these into one equation, we define L as <math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math>.<br />
<br />
<br><br />
Back to the original problem, from the Lagrangian we obtain <math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T s \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T s \textbf{w} </math> can be thought of as a quadratic function in '''w''', hence the '''2sw''' below)<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br />
<math>\displaystyle \frac{\delta L}{\delta \textbf{w}} = 2s\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br />
Set <math> \displaystyle \frac{\delta L}{\delta \textbf{w}} = 0 </math>, we get<br />
<br><br />
<math>\displaystyle s\textbf{w} = \lambda\textbf{w} </math><br />
<br><br />
This equation means that <math>\textbf{w}</math> is an eigenvector of s and <math>\lambda</math> is an eigenvalue of s.<br />
<br><br />
If we substitute <math>\displaystyle\textbf{w}</math> in <math>\displaystyle \textbf{w}^T s\textbf{w}</math> we obtain <math>\displaystyle\textbf{w}^T s\textbf{w} = \textbf{w}^T \lambda \textbf{w} = \lambda w^T w = \lambda </math><br />
<br><br />
In order to maximize the objective function we need to choose the eigenvector with the largest eigenvalue.<br />
<br />
We choose the first PC, '''u1''' to have the maximum variance (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible<br />
<br />
Subsequent principal components will take up successively smaller parts of the total variability<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the proginal and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)';<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)')<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
The noisy face:<br />
<br />
[[File:face1.jpg]]<br />
<br />
The de-noised face:<br />
<br />
[[File:face2.jpg]]<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
Main Contribution not complete<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (images etc). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
MATLAB CODE<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
[[File:PCAalgorithm.JPG]]<br />
<br />
Other Notes:<br />
1. The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean.<br />
2. Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the <br />
inner product. <math>U^T *X </math> is a (d x n) matrix.<br />
3. When we reconstruct the training set, we are only using the top d dimensions. This will eliminate the <br />
dimensions that have lower variance (e.g. noise)<br />
4. We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
Main Contribution Note complete<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes. For a 2 class problem, we want to reduce the data to one dimension (a line). Generally, for a k class problem, we want k-1 dimensions.<br />
<br />
As we will see from our objective function, we want to maximize the seperation of the classes. That is, our ideal situation is that the individual classes are as far away from eachother as possible, but the each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
<b> Goal </b><br />
<br />
<b>1. Minimize the within class variance</b><br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br />
<br />
<b>2. Maximize the distance between the means of the projected data after projection</b><br />
<br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 </math><br />
<br />
<math>\displaystyle = (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) </math><br />
<br />
which is a scalar. Therefore,<br />
<br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br />
Thus, we get our original problem equivalent to<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math></div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3227stat341 / CM 3612009-07-18T20:54:49Z<p>Hclam: /* Lagrange Multiplier */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>i=1\dots n</math> from <math>f(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
<br />
The histogram shows exponential pattern as expected.<br />
<br />
[[File:HistRandNum.jpg]]<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant <math>c \ |\ c \cdot g(x) \geq f(x)\ \forall x</math>, accepting samples drawn in successions from <math> c \cdot g(x)</math> with ratio <math> \frac {f(x)}{c \cdot g(x)} </math> close to 1 will yield a sample that follows the target distribution <math>f(x)</math>; on the other hand we would reject the samples if the ratio is not close to 1.<br />
<br />
The following graph shows the pdf of <math>f(x)</math> (target distribution) and <math> c \cdot g(x)</math> (proposal distribution)<br />
<br />
[[File:fxcgx.JPG]]<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in [http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis]).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
Given a high-dimensional sample of vectors, applying PCA produces an orthogonal set of vectors (called principal components) such that the first principal component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
If we ignore the last few principal components (directions with the smallest variance) then we can approximate the data by a lower-dimensional subspace, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction], PCA is linear).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of PCA===<br />
We want to find the direction of maximum variation. So take a direction <math>w = [w_1, \ldots, w_D]^T</math> and a data point <math>x = [x_1, \ldots, x_D]^T </math> then compute the length of the projection of the point in direction.<br />
<br />
<math><br />
u = \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
</math><br />
<br />
Of course, the direction <math>\textbf{w}</math> is the same as <math>2\textbf{w}</math> or in general <math>c\textbf{w}</math>, and it doesn't matter which one we use. So without loss of generality, let the length of <math>\textbf{w}</math> be 1. Therefore <math>\textbf{w}^T \textbf{w} = 1</math> so the equation simplifies to just<br />
<br />
<math><br />
u = \textbf{w}^T \textbf{x}.<br />
</math><br />
<br />
Let <math>x_1, \ldots, x_D</math> be a random variables, then our goal is to maximize the variance of <math>u</math>, which is<br />
<br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br />
where <math>\Sigma</math> is the covariance matrix. For a finite data set we can replace <math>\Sigma</math> by <math>s</math>, the sample covariance matrix. <br />
<br />
So, <math>\displaystyle w^T sw </math> is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math><br />
<br />
The first principal component is the vector that maximizes the variance<br />
<br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function.<br />
<br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br />
Therefore the variance is bounded, so the maximum exists. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, but subject to a constraint. The problem then becomes:<br />
<br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br />
<br />
Next lecture we will actually find the maximum.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a Lagrange Multiplier and we form the Lagrangian L:<br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of L (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touches but do not cross. At this point, the tangent of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel (or the gradient of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel).<br />
<br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where <math>\displaystyle \nabla_{x,y} f = (\frac{\delta f}{dx},\frac{\delta f}{dy})</math><br />
<br><br />
and <math>\displaystyle \nabla_{x,y} g = (\frac{\delta g}{dx},\frac{\delta g}{dy})</math><br />
<br><br />
To incorporate these into one equation, we define L as <math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math>.<br />
<br />
<br><br />
Back to the original problem, from the Lagrangian we obtain <math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T s \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T s \textbf{w} </math> can be thought of as a quadratic function in '''w''', hence the '''2sw''' below)<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br />
<math>\displaystyle \frac{\delta L}{\delta \textbf{w}} = 2s\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br />
Set <math> \displaystyle \frac{\delta L}{\delta \textbf{w}} = 0 </math>, we get<br />
<br><br />
<math>\displaystyle s\textbf{w} = \lambda\textbf{w} </math><br />
<br><br />
This equation means that <math>\textbf{w}</math> is an eigenvector of s and <math>\lambda</math> is an eigenvalue of s.<br />
<br><br />
If we substitute <math>\displaystyle\textbf{w}</math> in <math>\displaystyle \textbf{w}^T s\textbf{w}</math> we obtain <math>\displaystyle\textbf{w}^T s\textbf{w} = \textbf{w}^T \lambda \textbf{w} = \lambda w^T w = \lambda </math><br />
<br><br />
In order to maximize the objective function we need to choose the eigenvector with the largest eigenvalue.<br />
<br />
We choose the first PC, '''u1''' to have the maximum variance (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible<br />
<br />
Subsequent principal components will take up successively smaller parts of the total variability<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the proginal and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)';<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)')<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<b> I can't seem to save more images on my nexus account, so could someone run the code above in matlab and plot the images?</b><br />
<br />
The noisy face:<br />
<br />
[[File:face1.jpg]]<br />
<br />
The de-noised face:<br />
<br />
[[File:face2.jpg]]<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
Main Contribution not complete<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (images etc). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
MATLAB CODE<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
[[File:PCAalgorithm.JPG]]<br />
<br />
Other Notes:<br />
1. The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean.<br />
2. Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the <br />
inner product. <math>U^T *X </math> is a (d x n) matrix.<br />
3. When we reconstruct the training set, we are only using the top d dimensions. This will eliminate the <br />
dimensions that have lower variance (e.g. noise)<br />
4. We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
Main Contribution Note complete<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes. For a 2 class problem, we want to reduce the data to one dimension (a line). Generally, for a k class problem, we want k-1 dimensions.<br />
<br />
As we will see from our objective function, we want to maximize the seperation of the classes. That is, our ideal situation is that the individual classes are as far away from eachother as possible, but the each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
<b> Goal </b><br />
<br />
<b>1. Minimize the within class variance</b><br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br />
<br />
<b>2. Maximize the distance between the means of the projected data after projection</b><br />
<br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 </math><br />
<br />
<math>\displaystyle = (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) </math><br />
<br />
which is a scalar. Therefore,<br />
<br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br />
Thus, we get our original problem equivalent to<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math></div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Face2.jpg&diff=3226File:Face2.jpg2009-07-18T20:54:35Z<p>Hclam: de-noised face</p>
<hr />
<div>de-noised face</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3225stat341 / CM 3612009-07-18T20:53:45Z<p>Hclam: /* Lagrange Multiplier */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>i=1\dots n</math> from <math>f(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
<br />
The histogram shows exponential pattern as expected.<br />
<br />
[[File:HistRandNum.jpg]]<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant <math>c \ |\ c \cdot g(x) \geq f(x)\ \forall x</math>, accepting samples drawn in successions from <math> c \cdot g(x)</math> with ratio <math> \frac {f(x)}{c \cdot g(x)} </math> close to 1 will yield a sample that follows the target distribution <math>f(x)</math>; on the other hand we would reject the samples if the ratio is not close to 1.<br />
<br />
The following graph shows the pdf of <math>f(x)</math> (target distribution) and <math> c \cdot g(x)</math> (proposal distribution)<br />
<br />
[[File:fxcgx.JPG]]<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in [http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis]).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
Given a high-dimensional sample of vectors, applying PCA produces an orthogonal set of vectors (called principal components) such that the first principal component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
If we ignore the last few principal components (directions with the smallest variance) then we can approximate the data by a lower-dimensional subspace, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction], PCA is linear).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of PCA===<br />
We want to find the direction of maximum variation. So take a direction <math>w = [w_1, \ldots, w_D]^T</math> and a data point <math>x = [x_1, \ldots, x_D]^T </math> then compute the length of the projection of the point in direction.<br />
<br />
<math><br />
u = \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
</math><br />
<br />
Of course, the direction <math>\textbf{w}</math> is the same as <math>2\textbf{w}</math> or in general <math>c\textbf{w}</math>, and it doesn't matter which one we use. So without loss of generality, let the length of <math>\textbf{w}</math> be 1. Therefore <math>\textbf{w}^T \textbf{w} = 1</math> so the equation simplifies to just<br />
<br />
<math><br />
u = \textbf{w}^T \textbf{x}.<br />
</math><br />
<br />
Let <math>x_1, \ldots, x_D</math> be a random variables, then our goal is to maximize the variance of <math>u</math>, which is<br />
<br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br />
where <math>\Sigma</math> is the covariance matrix. For a finite data set we can replace <math>\Sigma</math> by <math>s</math>, the sample covariance matrix. <br />
<br />
So, <math>\displaystyle w^T sw </math> is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math><br />
<br />
The first principal component is the vector that maximizes the variance<br />
<br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function.<br />
<br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br />
Therefore the variance is bounded, so the maximum exists. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, but subject to a constraint. The problem then becomes:<br />
<br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br />
<br />
Next lecture we will actually find the maximum.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a Lagrange Multiplier and we form the Lagrangian L:<br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of L (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touches but do not cross. At this point, the tangent of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel (or the gradient of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel).<br />
<br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where <math>\displaystyle \nabla_{x,y} f = (\frac{\delta f}{dx},\frac{\delta f}{dy})</math><br />
<br><br />
and <math>\displaystyle \nabla_{x,y} g = (\frac{\delta g}{dx},\frac{\delta g}{dy})</math><br />
<br><br />
To incorporate these into one equation, we define L as <math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math>.<br />
<br />
<br><br />
Back to the original problem, from the Lagrangian we obtain <math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T s \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T s \textbf{w} </math> can be thought of as a quadratic function in '''w''', hence the '''2sw''' below)<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br />
<math>\displaystyle \frac{\delta L}{\delta \textbf{w}} = 2s\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br />
Set <math> \displaystyle \frac{\delta L}{\delta \textbf{w}} = 0 </math>, we get<br />
<br><br />
<math>\displaystyle s\textbf{w} = \lambda\textbf{w} </math><br />
<br><br />
This equation means that <math>\textbf{w}</math> is an eigenvector of s and <math>\lambda</math> is an eigenvalue of s.<br />
<br><br />
If we substitute <math>\displaystyle\textbf{w}</math> in <math>\displaystyle \textbf{w}^T s\textbf{w}</math> we obtain <math>\displaystyle\textbf{w}^T s\textbf{w} = \textbf{w}^T \lambda \textbf{w} = \lambda w^T w = \lambda </math><br />
<br><br />
In order to maximize the objective function we need to choose the eigenvector with the largest eigenvalue.<br />
<br />
We choose the first PC, '''u1''' to have the maximum variance (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible<br />
<br />
Subsequent principal components will take up successively smaller parts of the total variability<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the proginal and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)';<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)')<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<b> I can't seem to save more images on my nexus account, so could someone run the code above in matlab and plot the images?</b><br />
<br />
The noisy face:<br />
<br />
[[File:face1.jpg]]<br />
<br />
The de-noised face:<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
Main Contribution not complete<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (images etc). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
MATLAB CODE<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
[[File:PCAalgorithm.JPG]]<br />
<br />
Other Notes:<br />
1. The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean.<br />
2. Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the <br />
inner product. <math>U^T *X </math> is a (d x n) matrix.<br />
3. When we reconstruct the training set, we are only using the top d dimensions. This will eliminate the <br />
dimensions that have lower variance (e.g. noise)<br />
4. We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
Main Contribution Note complete<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes. For a 2 class problem, we want to reduce the data to one dimension (a line). Generally, for a k class problem, we want k-1 dimensions.<br />
<br />
As we will see from our objective function, we want to maximize the seperation of the classes. That is, our ideal situation is that the individual classes are as far away from eachother as possible, but the each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
<b> Goal </b><br />
<br />
<b>1. Minimize the within class variance</b><br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br />
<br />
<b>2. Maximize the distance between the means of the projected data after projection</b><br />
<br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 </math><br />
<br />
<math>\displaystyle = (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) </math><br />
<br />
which is a scalar. Therefore,<br />
<br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br />
Thus, we get our original problem equivalent to<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math></div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Face1.jpg&diff=3224File:Face1.jpg2009-07-18T20:51:46Z<p>Hclam: noisy face</p>
<hr />
<div>noisy face</div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3223stat341 / CM 3612009-07-18T20:42:37Z<p>Hclam: /* Lagrange Multiplier */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>i=1\dots n</math> from <math>f(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
<br />
The histogram shows exponential pattern as expected.<br />
<br />
[[File:HistRandNum.jpg]]<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant <math>c \ |\ c \cdot g(x) \geq f(x)\ \forall x</math>, accepting samples drawn in successions from <math> c \cdot g(x)</math> with ratio <math> \frac {f(x)}{c \cdot g(x)} </math> close to 1 will yield a sample that follows the target distribution <math>f(x)</math>; on the other hand we would reject the samples if the ratio is not close to 1.<br />
<br />
The following graph shows the pdf of <math>f(x)</math> (target distribution) and <math> c \cdot g(x)</math> (proposal distribution)<br />
<br />
[[File:fxcgx.JPG]]<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in [http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis]).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
Given a high-dimensional sample of vectors, applying PCA produces an orthogonal set of vectors (called principal components) such that the first principal component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
If we ignore the last few principal components (directions with the smallest variance) then we can approximate the data by a lower-dimensional subspace, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction], PCA is linear).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of PCA===<br />
We want to find the direction of maximum variation. So take a direction <math>w = [w_1, \ldots, w_D]^T</math> and a data point <math>x = [x_1, \ldots, x_D]^T </math> then compute the length of the projection of the point in direction.<br />
<br />
<math><br />
u = \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
</math><br />
<br />
Of course, the direction <math>\textbf{w}</math> is the same as <math>2\textbf{w}</math> or in general <math>c\textbf{w}</math>, and it doesn't matter which one we use. So without loss of generality, let the length of <math>\textbf{w}</math> be 1. Therefore <math>\textbf{w}^T \textbf{w} = 1</math> so the equation simplifies to just<br />
<br />
<math><br />
u = \textbf{w}^T \textbf{x}.<br />
</math><br />
<br />
Let <math>x_1, \ldots, x_D</math> be a random variables, then our goal is to maximize the variance of <math>u</math>, which is<br />
<br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br />
where <math>\Sigma</math> is the covariance matrix. For a finite data set we can replace <math>\Sigma</math> by <math>s</math>, the sample covariance matrix. <br />
<br />
So, <math>\displaystyle w^T sw </math> is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math><br />
<br />
The first principal component is the vector that maximizes the variance<br />
<br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function.<br />
<br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br />
Therefore the variance is bounded, so the maximum exists. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, but subject to a constraint. The problem then becomes:<br />
<br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br />
<br />
Next lecture we will actually find the maximum.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a Lagrange Multiplier and we form the Lagrangian L:<br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of L (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touches but do not cross. At this point, the tangent of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel (or the gradient of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel).<br />
<br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where <math>\displaystyle \nabla_{x,y} f = (\frac{\delta f}{dx},\frac{\delta f}{dy})</math><br />
<br><br />
and <math>\displaystyle \nabla_{x,y} g = (\frac{\delta g}{dx},\frac{\delta g}{dy})</math><br />
<br><br />
To incorporate these into one equation, we define L as <math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math>.<br />
<br />
<br><br />
Back to the original problem, from the Lagrangian we obtain <math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T s \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T s \textbf{w} </math> can be thought of as a quadratic function in '''w''', hence the '''2sw''' below)<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br />
<math>\displaystyle \frac{\delta L}{\delta \textbf{w}} = 2s\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br />
Set <math> \displaystyle \frac{\delta L}{\delta \textbf{w}} = 0 </math>, we get<br />
<br><br />
<math>\displaystyle s\textbf{w} = \lambda\textbf{w} </math><br />
<br><br />
This equation means that <math>\textbf{w}</math> is an eigenvector of s and <math>\lambda</math> is an eigenvalue of s.<br />
<br><br />
If we substitute <math>\displaystyle\textbf{w}</math> in <math>\displaystyle \textbf{w}^T s\textbf{w}</math> we obtain <math>\displaystyle\textbf{w}^T s\textbf{w} = \textbf{w}^T \lambda \textbf{w} = \lambda w^T w = \lambda </math><br />
<br><br />
In order to maximize the objective function we need to choose the eigenvector with the largest eigenvalue.<br />
<br />
We choose the first PC, '''u1''' to have the maximum variance (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible<br />
<br />
Subsequent principal components will take up successively smaller parts of the total variability<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the proginal and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)';<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)')<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<b> I can't seem to save more images on my nexus account, so could someone run the code above in matlab and plot the images?</b><br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
Main Contribution not complete<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (images etc). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
MATLAB CODE<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
[[File:PCAalgorithm.JPG]]<br />
<br />
Other Notes:<br />
1. The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean.<br />
2. Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the <br />
inner product. <math>U^T *X </math> is a (d x n) matrix.<br />
3. When we reconstruct the training set, we are only using the top d dimensions. This will eliminate the <br />
dimensions that have lower variance (e.g. noise)<br />
4. We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
Main Contribution Note complete<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes. For a 2 class problem, we want to reduce the data to one dimension (a line). Generally, for a k class problem, we want k-1 dimensions.<br />
<br />
As we will see from our objective function, we want to maximize the seperation of the classes. That is, our ideal situation is that the individual classes are as far away from eachother as possible, but the each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
<b> Goal </b><br />
<br />
<b>1. Minimize the within class variance</b><br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br />
<br />
<b>2. Maximize the distance between the means of the projected data after projection</b><br />
<br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 </math><br />
<br />
<math>\displaystyle = (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) </math><br />
<br />
which is a scalar. Therefore,<br />
<br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br />
Thus, we get our original problem equivalent to<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math></div>Hclamhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3222stat341 / CM 3612009-07-18T20:38:33Z<p>Hclam: /* Lagrange Multiplier */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>i=1\dots n</math> from <math>f(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
<br />
The histogram shows exponential pattern as expected.<br />
<br />
[[File:HistRandNum.jpg]]<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant <math>c \ |\ c \cdot g(x) \geq f(x)\ \forall x</math>, accepting samples drawn in successions from <math> c \cdot g(x)</math> with ratio <math> \frac {f(x)}{c \cdot g(x)} </math> close to 1 will yield a sample that follows the target distribution <math>f(x)</math>; on the other hand we would reject the samples if the ratio is not close to 1.<br />
<br />
The following graph shows the pdf of <math>f(x)</math> (target distribution) and <math> c \cdot g(x)</math> (proposal distribution)<br />
<br />
[[File:fxcgx.JPG]]<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in [http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis]).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
Given a high-dimensional sample of vectors, applying PCA produces an orthogonal set of vectors (called principal components) such that the first principal component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
If we ignore the last few principal components (directions with the smallest variance) then we can approximate the data by a lower-dimensional subspace, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction], PCA is linear).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of PCA===<br />
We want to find the direction of maximum variation. So take a direction <math>w = [w_1, \ldots, w_D]^T</math> and a data point <math>x = [x_1, \ldots, x_D]^T </math> then compute the length of the projection of the point in direction.<br />
<br />
<math><br />
u = \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
</math><br />
<br />
Of course, the direction <math>\textbf{w}</math> is the same as <math>2\textbf{w}</math> or in general <math>c\textbf{w}</math>, and it doesn't matter which one we use. So without loss of generality, let the length of <math>\textbf{w}</math> be 1. Therefore <math>\textbf{w}^T \textbf{w} = 1</math> so the equation simplifies to just<br />
<br />
<math><br />
u = \textbf{w}^T \textbf{x}.<br />
</math><br />
<br />
Let <math>x_1, \ldots, x_D</math> be a random variables, then our goal is to maximize the variance of <math>u</math>, which is<br />
<br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br />
where <math>\Sigma</math> is the covariance matrix. For a finite data set we can replace <math>\Sigma</math> by <math>s</math>, the sample covariance matrix. <br />
<br />
So, <math>\displaystyle w^T sw </math> is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math><br />
<br />
The first principal component is the vector that maximizes the variance<br />
<br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function.<br />
<br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br />
Therefore the variance is bounded, so the maximum exists. Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, but subject to a constraint. The problem then becomes:<br />
<br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br />
<br />
Next lecture we will actually find the maximum.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a Lagrange Multiplier and we form the Lagrangian L:<br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of L (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touches but do not cross. At this point, the tangent of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel (or the gradient of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel).<br />
<br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where <math>\displaystyle \nabla_{x,y} f = (\frac{\delta f}{dx},\frac{\delta f}{dy})</math><br />
<br><br />
and <math>\displaystyle \nabla_{x,y} g = (\frac{\delta g}{dx},\frac{\delta g}{dy})</math><br />
<br><br />
To incorporate these into one equation, we define L as <math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math>.<br />
<br />
<br><br />
Back to the original problem, from the Lagrangian we obtain <math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T s \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T s \textbf{w} </math> can be thought of as a quadratic function in '''w''', hence the '''2sw''' below)<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br />
<math>\displaystyle \frac{\delta L}{\delta \textbf{w}} = 2s\textbf{w} - 2\lambda\textbf{w} = 0 </math><br />
<br><br />
<math>\displaystyle s\textbf{w} = \lambda\textbf{w} </math><br />
<br><br />
This equation means that <math>\textbf{w}</math> is an eigenvector of s and <math>\lambda</math> is an eigenvalue of s.<br />
<br><br />
If we substitute <math>\displaystyle\textbf{w}</math> in <math>\displaystyle \textbf{w}^T s\textbf{w}</math> we obtain <math>\displaystyle\textbf{w}^T s\textbf{w} = \textbf{w}^T \lambda \textbf{w} = \lambda w^T w = \lambda </math><br />
<br><br />
In order to maximize the objective function we need to choose the eigenvector with the largest eigenvalue.<br />
<br />
We choose the first PC, '''u1''' to have the maximum variance (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible<br />
<br />
Subsequent principal components will take up successively smaller parts of the total variability<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the proginal and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)';<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)')<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
<b> I can't seem to save more images on my nexus account, so could someone run the code above in matlab and plot the images?</b><br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
Main Contribution not complete<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (images etc). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
MATLAB CODE<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
[[File:PCAalgorithm.JPG]]<br />
<br />
Other Notes:<br />
1. The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean.<br />
2. Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the <br />
inner product. <math>U^T *X </math> is a (d x n) matrix.<br />
3. When we reconstruct the training set, we are only using the top d dimensions. This will eliminate the <br />
dimensions that have lower variance (e.g. noise)<br />
4. We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
Main Contribution Note complete<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes. For a 2 class problem, we want to reduce the data to one dimension (a line). Generally, for a k class problem, we want k-1 dimensions.<br />
<br />
As we will see from our objective function, we want to maximize the seperation of the classes. That is, our ideal situation is that the individual classes are as far away from eachother as possible, but the each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
<b> Goal </b><br />
<br />
<b>1. Minimize the within class variance</b><br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br />
<br />
<b>2. Maximize the distance between the means of the projected data after projection</b><br />
<br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 </math><br />
<br />
<math>\displaystyle = (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) </math><br />
<br />
which is a scalar. Therefore,<br />
<br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br />
Thus, we get our original problem equivalent to<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math></div>Hclam